Reputation: 25820
What would be the most efficient (fast and reliable enough) way in JavaScript to determine the type of line breaks used in a text - Unix vs Windows.
In my Node app I have to read in large utf-8 text files and then process them based on whether they use Unix or Windows line breaks.
When the type of line breaks comes up as uncertain, I want to conclude based on which one it is most likely then.
UPDATE
As per my own answer below, the code I ended up using.
Upvotes: 8
Views: 8056
Reputation: 16928
Thank @Sam-Graham. I tried to produce an optimized way. Also, the output of the function is directly usable (see below example):
function getLineBreakChar(string) {
const indexOfLF = string.indexOf('\n', 1) // No need to check first-character
if (indexOfLF === -1) {
if (string.indexOf('\r') !== -1) return '\r'
return '\n'
}
if (string[indexOfLF - 1] === '\r') return '\r\n'
return '\n'
}
Note1: Supposed string
is healthy (only contains one type of line-breaks).
Note2: Supposed you want LF
to be default encoding (when no line-break found).
Usage example:
fs.writeFileSync(filePath,
string.substring(0, a) +
getLineBreakChar(string) +
string.substring(b)
);
This utility may be useful too:
const getLineBreakName = (lineBreakChar) =>
lineBreakChar === '\n' ? 'LF' : lineBreakChar === '\r' ? 'CR' : 'CRLF'
Upvotes: 5
Reputation: 25820
In the end I used my own solution for this, based on simple statistics:
const {EOL} = require('os');
function getEOL(text) {
const m = text.match(/\r\n|\n/g);
const u = m && m.filter(a => a === '\n').length;
const w = m && m.length - u;
if (u === w) {
return EOL; // use the OS default
}
return u > w ? '\n' : '\r\n';
}
When there are no line breaks, or their number suddenly equal, it will return the OS's default EOL.
UPDATE
Later on I found out through further practice, that if you want to process text in the same way, regardless of whether it has Unix or Windows encoding, then the most efficient approach is to simply replace any possible Windows encoding with the Unix one, and not bother with any verification at all:
text = text.replace(/\r\n/g, '\n'); // replace every \r\n with \n
Upvotes: 2
Reputation: 1360
You would want to look first for an LF. like source.indexOf('\n')
and then see if the character behind it is a CR like source[source.indexOf('\n')-1] === '\r'
. This way, you just find the first example of a newline and match to it. In summary,
function whichLineEnding(source) {
var temp = source.indexOf('\n');
if (source[temp - 1] === '\r')
return 'CRLF'
return 'LF'
}
There are two popularish examples of libraries doing this in the npm modules: node-newline and crlf-helper The first does a split on the entire string which is very inefficient in your case. The second uses a regex which in your case would not be quick enough.
However, from your edit, if you want to determine which is more plentiful. Then I would use the code from node-newline as it does handle that case.
Upvotes: 7
Reputation: 13538
This is how we detect line endings in JavaScript files using ESLint rule. Source means the actual file content.
Note: Sometimes you can have files with mixed line-endings also.
https://github.com/eslint/eslint/blob/master/lib/rules/linebreak-style.js
Upvotes: 1
Reputation: 56
Try this
if(text.search(/\r/) > -1 || text.search(/\r\n/) > -1){
alert('Windows');
} else if(text.search(/\n/) > -1){
alert('Unix');
} else {
alert('No line breaks found')
}
Upvotes: 1