Aditya
Aditya

Reputation: 571

Regex to match pattern until next occurence of it

I have following data:

2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
  'x':1,
  'y':2,
  'z':3,
  'w':4
}
2018-03-20 23:28:47 INFO This is an info sample(can be multiline with new line characters)
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
  'a':5,
  'b':6,
  'c':7,
  'd':8
}

I've to extract all DEBUG statements and for that I am using this regex (\d{4}\-\d{2}\-\d{2}\ \d{2}\:\d{2}\:\d{2}\ DEBUG(.|\n|\r)*?)(?=\d{4}\-\d{2}\-\d{2}\ \d{2}\:\d{2}\:\d{2}) but it is omitting the last DEBUG statement. What should be the regex to obtain following output?

2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters) {
  'x':1,
  'y':2,
  'z':3,
  'w':4
}
2018-03-20 23:28:47 DEBUG This is a debug sample(can be multiline with new line characters){
  'a':5,
  'b':6,
  'c':7,
  'd':8
}

Upvotes: 2

Views: 609

Answers (2)

Gsk
Gsk

Reputation: 2945

If you are sure that all the paragraphs with DEBUG will end with }, you can use:

r"(.*DEBUG[\s\S]*?\})"

If DEBUG may or may not have {}, the following regex should do the trick:

r"(.*DEBUG.*(?!=\{|\n))(\{[\s\S]*?\})?"

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

I suggest:

  • Anchor the matches at the start of the line to make it safer (by using (?m))
  • Fix the current issue by adding an alternative with the very end of the string \Z (same as Ken suggests in the comments)
  • Replace a very inefficient (.|\r|\n)*? pattern with .*? and adding a DOTALL modifier (?s)

The whole fix will look like

(?sm)^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} DEBUG\s*(.*?)(?=[\r\n]+\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z)

See the regex demo.

Details

  • (?sm) - DOTALL and MULTILINE options on
  • ^ - start of a line
  • \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - a timestamp like pattern
  • DEBUG - a literal substring
  • \s* - 0+ whitespaces
  • (.*?) - Group 1: any 0+ chars, as few as possible, up to but excluding
  • (?=[\r\n]+\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}|\Z) - a positive lookahead that requires either
    • [\r\n]+\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} - one or more CR or LF symbol(s) followed with a timestamp like pattern
    • | - or
    • \Z - the very end of the string

Upvotes: 3

Related Questions