Anish
Anish

Reputation: 758

How to match multi-line text using the right regex capture group?

I'm trying to read in a CSV and split each row using regex capture groups. The last column of the CSV has newline characters in it and my regex's second capture group seems to be breaking at the first occurrence of that newline character and not capturing the rest of the string.

Below is what I've managed to do so far. The first record always starts with ABC-, so I put that in my first capturing group and everything else after it, till the next occurrence of ABC- or end of file (if last record), should be captured by the second capturing group. The first row works as expected because there's no newline characters in it, but the rest won't.

My regex: ([A-Z1-9]+)-\d*,(.*)

My test string:

ABC-1,01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",
ABC-2,01/01/1974,X2,Y2,Z2,"THIS IS
A RANDOM

MULTI LINE
TEXT 2",
ABC-3,01/01/1974,X3,Y3,Z3,"THIS IS

ANOTHER RANDOM
MULTI LINE TEXT",

Expected result is:

3 matches

Match 1:

Group 1: ABC-1,

Group 2: 01/01/1974,X1,Y1,Z1,"RANDOM SINLGE LINE TEXT 1",

Match 2:

Group 1: ABC-2,

Group 2: 01/01/1974,X2,Y2,Z2,"THIS IS

A RANDOM

MULTI LINE

TEXT 2",

Match 3:

Group 1: ABC-3,

Group 2: 01/01/1974,X3,Y3,Z3,"THIS IS

ANOTHER RANDOM

MULTI LINE TEXT",

enter image description here

Upvotes: 1

Views: 186

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626835

You can use

^([A-Z]+-\d+),(.*(?:\n(?![A-Z]+-\d+,).*)*)

See the regex demo. Only use it with the multiline flag (if it is not Ruby, as ^ already matches line start positions in Ruby).

Details:

  • ^ - start of a line
  • ([A-Z]+-\d+) - Group 1: one or more uppercase ASCII letters and then - and one or more digits
  • , - a comma
  • (.*(?:\n(?![A-Z]+-\d+,).*)*) - Group 2:
    • .* - the rest of the line
    • (?:\n(?![A-Z]+-\d+,).*)* - zero or more lines that do not start with one or more uppercase ASCII letters and then - and one or more digits + a comma

Upvotes: 1

Alex Sveshnikov
Alex Sveshnikov

Reputation: 4329

You can try to limit the second group by a looking-ahead assertion:

(ABC-\d+,)(.*?(?=^ABC|\z))

Demo here.

Upvotes: 0

Related Questions