vitaly-t
vitaly-t

Reputation: 25820

Detecting type of line breaks

What would be the most efficient (fast and reliable enough) way in JavaScript to determine the type of line breaks used in a text - Unix vs Windows.

In my Node app I have to read in large utf-8 text files and then process them based on whether they use Unix or Windows line breaks.

When the type of line breaks comes up as uncertain, I want to conclude based on which one it is most likely then.

UPDATE

As per my own answer below, the code I ended up using.

Upvotes: 8

Views: 8056

Answers (5)

Mir-Ismaili
Mir-Ismaili

Reputation: 16928

Thank @Sam-Graham. I tried to produce an optimized way. Also, the output of the function is directly usable (see below example):

function getLineBreakChar(string) {
    const indexOfLF = string.indexOf('\n', 1)  // No need to check first-character
    
    if (indexOfLF === -1) {
        if (string.indexOf('\r') !== -1) return '\r'
        
        return '\n'
    }
    
    if (string[indexOfLF - 1] === '\r') return '\r\n'
    
    return '\n'
}

Note1: Supposed string is healthy (only contains one type of line-breaks).

Note2: Supposed you want LF to be default encoding (when no line-break found).


Usage example:

fs.writeFileSync(filePath,
        string.substring(0, a) +
        getLineBreakChar(string) +
        string.substring(b)
);

This utility may be useful too:

const getLineBreakName = (lineBreakChar) =>
    lineBreakChar === '\n' ? 'LF' : lineBreakChar === '\r' ? 'CR' : 'CRLF'

Upvotes: 5

vitaly-t
vitaly-t

Reputation: 25820

In the end I used my own solution for this, based on simple statistics:

const {EOL} = require('os');

function getEOL(text) {
    const m = text.match(/\r\n|\n/g);
    const u = m && m.filter(a => a === '\n').length;
    const w = m && m.length - u;
    if (u === w) {
        return EOL; // use the OS default
    }
    return u > w ? '\n' : '\r\n';
}

When there are no line breaks, or their number suddenly equal, it will return the OS's default EOL.

UPDATE

Later on I found out through further practice, that if you want to process text in the same way, regardless of whether it has Unix or Windows encoding, then the most efficient approach is to simply replace any possible Windows encoding with the Unix one, and not bother with any verification at all:

text = text.replace(/\r\n/g, '\n'); // replace every \r\n with \n

Upvotes: 2

Sam-Graham
Sam-Graham

Reputation: 1360

You would want to look first for an LF. like source.indexOf('\n') and then see if the character behind it is a CR like source[source.indexOf('\n')-1] === '\r'. This way, you just find the first example of a newline and match to it. In summary,

function whichLineEnding(source) {
     var temp = source.indexOf('\n');
     if (source[temp - 1] === '\r')
         return 'CRLF'
     return 'LF'
}

There are two popularish examples of libraries doing this in the npm modules: node-newline and crlf-helper The first does a split on the entire string which is very inefficient in your case. The second uses a regex which in your case would not be quick enough.

However, from your edit, if you want to determine which is more plentiful. Then I would use the code from node-newline as it does handle that case.

Upvotes: 7

Gyandeep
Gyandeep

Reputation: 13538

This is how we detect line endings in JavaScript files using ESLint rule. Source means the actual file content.

Note: Sometimes you can have files with mixed line-endings also.

https://github.com/eslint/eslint/blob/master/lib/rules/linebreak-style.js

Upvotes: 1

nigelheap
nigelheap

Reputation: 56

Try this

if(text.search(/\r/) > -1 || text.search(/\r\n/) > -1){
   alert('Windows');
} else if(text.search(/\n/) > -1){
   alert('Unix');
} else {
   alert('No line breaks found')
}

Upvotes: 1

Related Questions