Aymane Hrouch
Aymane Hrouch

Reputation: 103

Javascript regex to split by lines starting with a pattern

My goal is to extract messages from an exported conversation that looks like this

inputText = `3/24/18 - Username : message here
3/24/18 - anotherUser : another message`

What I tried

My naive approach was to just split whenever I have a new line, I used arr = inputText.match(/[^\r\n]+/g) (Source : JS regex to split by line) which does the work perfectly.

But now I'm facing a case that I didn't think about earlier, it's when a user sends a multi-line message, like:

inputText = `3/24/18 - Username : message here,
other text, same message
3/24/18 - anotherUser : another message`

The input of my first naive approach will output be wrong:

arr = ['3/24/18 - Username : message here',
       'other text, same message',
        '3/24/18- anotherUser : another message']

while I need it to be like this:

arr = ['3/24/18 - Username : message here message here too!!', 
       '3/24/18- anotherUser : another message']

I need to splitline but only when the line starts with the pattern m/d/y - username :

Upvotes: 0

Views: 377

Answers (1)

trincot
trincot

Reputation: 350167

If your lines always start with a date, formatted as in your example, then you could match that. Maybe it is somewhat easier with split

var inputText = `3/24/18 - Username : message here
message here too!!
3/24/18- anotherUser : another message`;

var result = inputText.split(/[\r\n]*(?=^\d+\/)/m).filter(Boolean);

console.log(result);

If you then want to replace the \r and \n with a space, add a map:

var inputText = `3/24/18 - Username : message here
message here too!!
3/24/18- anotherUser : another message`;

var result = inputText.split(/[\r\n]*(?=^\d+\/)/m).filter(Boolean)
    .map(text => text.replace(/[\r\n]+/g, " "));

console.log(result);

Explanation

The regular expression breaks down into the following parts:

  • [\r\n]*: any number of newline characters
  • (?= ): look-ahead to see whether pattern matches the next characters, without actually matching ("eating") them
  • ^\d+\/: the pattern that denotes the start of a line: one or more digits followed by a forward slash

Note that the regular expression will match the parts that should define the splits; those characters will not appear in the output. That is why the date-pattern is verified with look-ahead -- we don't want to lose those characters; they belong to the next line.

Because the very first characters of the input will match the split pattern, this will generate an empty string (for what precedes the split): this should be ignored. That is what .filter(Boolean) does. As empty strings are falsy, they will be left out by this filter.

Upvotes: 3

Related Questions