irom
irom

Reputation: 3596

regular expression to capture groups in selected lines

I have multi line string below (in python) and looking for regex to extract src, dst and severity. So in the example below group1 be '10.4.180.5' , group 2 '34.23.21.10' and group 3 'critical'

    src: 10.4.180.25
    dst: 34.23.21.10
    natsrc: 20.160.129.5
    natdst: 34.33.21.10
... more lines
    severity: critical
... more lines

If I try regex like /src: (\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b)\ndst: (\b\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\b)\n/ with gm flags it will find me src and dst but not severity which is few lines down (lines omitted for clarity). Is there a way to do it without including all of these lines between src, dst and severity ?

Upvotes: 4

Views: 2268

Answers (2)

Joe Iddon
Joe Iddon

Reputation: 20414

You can use a greedy lookup (think this is the right terminology) regex to do this:

src: (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})\ndst: (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})[\s\S]*?severity: (.+)?\n

I have updated the regex so it actually works now!

so it searches for the same bit you have, but then as there are many lines between the dst: line and the severity line, we need to skip all these lines.

To match any number of lines up to the line beginning with severity:, we need to match any characters - including new-lines. To do this, we can use a set of characters: [\s\S]. This means match any character which is not a space or is a space, i.e. all characters. We then put this in a greedy lookup to match as many any characters needed to get to the severity: line - so this bit is [\s\S]*?severity:.

Now we are at the severity: line, we want to match and return the characters up to the end of that line (up to the new-line \n character). This is done with the similar: (.+)?\n syntax but with a plus as we want to match one or more characters. Also, as want to return this bit, we need to put it in parentheses.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You missed need to actually match any number of lines that do not start with severity after what your pattern matches. Besides, you may shorten the pattern by using {3} limiting quantifier in order not to repeat \.\d{1,3} so many times. Note than between a whitespace and a digit, the word boundary is implicit, it is already there, no need to use \b.

Use

src:\s*(\d{1,3}(?:\.\d{1,3}){3})\ndst:\s*(\d{1,3}(?:\.\d{1,3}){3})(?:\n(?!severity).+)*?\nseverity:\s*(.+)

See the regex demo

Details

  • src: - a literal substring
  • \s* - 0+ whitespaces
  • (\d{1,3}(?:\.\d{1,3}){3}) - Group 1: IP-like pattern
  • \n - a newline
  • dst:\s* - dst: with 0+ whitespaces after it
  • (\d{1,3}(?:\.\d{1,3}){3}) - Group 1: IP-like pattern
  • (?:\n(?!severity).+)*? - 0+ sequences (as few as possible) of
    • \n(?!severity) - a newline not followed with severity
    • .+ - the whole line
  • \nseverity:\s* - a newline, severity: substring and 0+ whitespaces
  • (.+) - Group 3: 1 or more chars up to the end of the line

Note you do not need any DOTALL modifier with this regex.

Upvotes: 3

Related Questions