Martin Brooker
Martin Brooker

Reputation: 75

Regex multiline string individual matching groups

Before I start, I know there are already many threads created on this topic, but I couldn't find anything specific to my use case, therefore I would like to ask this question with a pratical example based on what I found.

I have a multiline string that contain content I'm interested in it:

hello: "
  -foo=value1
  -bar=value2
  -baz=\"value3\"
"

I would like to match group on specific content within the multiline string, so that it matches on whatever within this string starts with -. The conditions are:

I came up with this pattern:

hello:\s*"\s*-((?:[^"]|\n)*)"

Unfortunately, it doesn't consider escaped strings and also doesn't match each result. I need the matched groups to repeat on each line within that string.

Upvotes: 1

Views: 103

Answers (2)

Pubudu Sitinamaluwa
Pubudu Sitinamaluwa

Reputation: 978

I tried a recursive approach to this problem. Here is my solution.

I'm taking this as my input text block which I guess have captured all possibilities:

source_text = """
hello: "
  -foo=value1
  -bar=value2
  -baz=\"value3\"
"
bla: "
  -foo=value1
  -bar=value2
  -baz=\"value3\"
"
hello: "
  -foo=value1
  -bar=value2
  -baz=\"value3\"
"
hello: -baz=\"value3\"
hello: "-foo=value1"
"""

Recursively match the regex hello:\s("|\\")?\n?(\s*-\w+=\\?"?\w+\\?"?\n?)+ which captures the required text in match group 2

import re

all_matches = []

def get_all_matches(text):
    for block_match in re.finditer(r'hello:\s("|\\")?\n?(\s*-\w+=\\?"?\w+\\?"?\n?)+', text):
        match = block_match.group(2)
        if match:
            all_matches.append(re.sub("\n|\\s", '', match))
            full_match = block_match.group(0)
            get_all_matches(full_match.replace(match, ''))

get_all_matches(source_text)

Output

print(all_matches)
['-baz="value3"', '-bar=value2', '-foo=value1', '-baz="value3"', '-bar=value2', '-foo=value1', '-baz="value3"', '-foo=value1"']

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You can use

hello:\s*"(?:\s*-((?:\\.|[^\n"\\])+))+\s*"

See this regex demo at the .NET regex tester. Details:

  • hello:\s*" - hello:, zero or more whitespaces, "
  • (?:\s*-((?:\\.|[^\n"\\])+))+ - one or more occurrences of:
    • \s*- zero or more occurrences of whitespace chars and then a -
    • ((?:\\.|[^\n"\\])+) - Group 1: one or more occurrences of any escaped char or any char but \, " and a newline
  • \s*" - zero or more whitespace and then ".

See this C# demo:

var s = "hello: \"\n  -foo=value1\n  -bar=value2\n  -baz=\\\"value3\\\"\n\"";
var rx = @"hello:\s*""(?:\s*-((?:\\.|[^\n""\\])+))+\s*""";
var result = Regex.Match(s, rx)?.Groups[1].Captures.Cast<Capture>().Select(x => x.Value).ToList();
foreach (var t in result)
    Console.WriteLine(t);

See this Python demo:

import regex
s = 'hello: "\n  -foo=value1\n  -bar=value2\n  -baz=\\"value3\\"\n"';
rx = r'hello:\s*"(?:\s*-((?:\\.|[^\n"\\])+))+\s*"';
match = regex.search(rx, s)
if match:
    print(match.captures(1))

Output:

foo=value1
bar=value2
baz=\"value3\"

This approach won't work in Go. You need to extract the whole match with hello:\s*"([^"\\]*(?:\\[\s\S][^"\\]*)*)" and then split the Group 1 value with a line break sequence.

Upvotes: 1

Related Questions