janlan
janlan

Reputation: 477

Match multiple line text (from 1 to n lines) until certain new line regex

I created regex for matching such pattern:

<some text>
yyyy.MM.dd SOME TEXT decimal decimal
yyy.MM.dd
some sentence
some sentence
some sentence (it can have from 1 to n lines of comments) but  the last line that does not start with yyyy.MM.dd SOME TEXT decimal decimal)
yyyy.MM.dd SOME TEXT decimal decimal
yyy.MM.dd
some sentence
some sentence
some sentence
...
<some text>

The regex:

((\d{4}\.\d{2}\.\d{2})\s([a-zA-Z\s]{0,})\s(\-{0,1}((\d{1}\,\d{2})|(\d{1,}\ \d{3}\,\d{2})))\s(\-{0,1}((\d{1}\,\d{2})|(\d{1,}\ \d{3}\,\d{2}))\s)(\d{4}\.\d{2}\.\d{2}))

Which matches only first 2 lines. I can't match multiline sentences until next yyyy.MM.dd SOME TEXT decimal decimal (exclusively)

This is the test data for matching:

2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
...

it should match like this:

1.

2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30
Some text that should be
matched 20.01.2020 as
multiline text
until now

For me it matches like this:

1.

2020.11.01 SOME TEXT -17,30 83 016,86
2020.10.30
2020.11.01 SOME TEXT -27,30 81 016,86
2020.10.30

How can I match from 1 to many multiline lines WITHOUT 'yyyy.MM.dd SOME TEXT decimal decimal' on the next line?

Upvotes: 1

Views: 66

Answers (1)

The fourth bird
The fourth bird

Reputation: 163207

For the example data, you can match the first 2 lines with a date like pattern, followed by all the lines that do not start with a datelike pattern.

Note that \d{4}\.\d{2}\.\d{2} does not validate a date itself. To get a more precise match, this page has more detailed examples.

^\d{4}\.\d{2}\.\d{2} .*\r?\n\d{4}\.\d{2}\.\d{2}\b.*(?:\r?\n(?!\d{4}\.\d{2}\.\d{2}\b).*)*

Regex demo

Or if you first want to match all lines that start with a datelike pattern incase of 1 or more, followed with lines that do not:

^\d{4}\.\d{2}\.\d{2} \S.*(?:\r?\n\d{4}\.\d{2}\.\d{2}\b.*)+(?:\r?\n(?!\d{4}\.\d{2}\.\d{2}\b).*)*

Explanation

  • ^ Start of the string
  • \d{4}\.\d{2}\.\d{2} \S.* match a datelike pattern followed by a space, at least a non whitespace char (For SOME TEXT in the example) and the rest of the line
  • (?:\r?\n\d{4}\.\d{2}\.\d{2}\b.*)+ Repeat 1+ times matches lines that start with a datelike pattern
  • (?: Non capture group (to repeat as a whole)
    • \r?\n Match a newline
    • (?!\d{4}\.\d{2}\.\d{2}\b) Assert not a datelike format directly to the right
    • .* If the previous assertion it true, match the whole line
  • )* Optionally repeat all lines that do not start with a datelike pattern (If there should be at least 1 line, change the quantifier to +)

Regex demo

Upvotes: 2

Related Questions