Basel Shishani
Basel Shishani

Reputation: 8177

Matching a string that's arbitrarily splits over multiple lines

Is there a way in regex's to match a string that is arbitrarily split over multiple lines - say we have the following format in a file:

msgid "This is "
"an example string"
msgstr "..."

msgid "This is an example string"
msgstr "..."

msgid ""
"This is an " 
"example" 
" string"
msgstr "..."

msgid "This is " 
"an unmatching string" 
msgstr "..."

So we would like to have a pattern that would match all the example strings, ie: match the string regardless of how it's split across lines. Notice that we are after a specific string as shown in the sample, not just any string. So in this case we would like to match the string "This is an example string".

Of course we can can easily concat the strings then apply the match, but got me wondering if this is possible. I'm talking Python regex's but a general answer is ok.

Upvotes: 4

Views: 316

Answers (2)

Josiah
Josiah

Reputation: 3326

This is a bit tricky with the need for quotes on every line, and the allowance of empty lines. Here's a regex that matches the file you posted correctly:

'(""\n)*"This(( "\n(""\n)*")|("\n(""\n)*" )| )is(( "\n(""\n)*")|("\n(""\n)*" )| )an(( "\n(""\n)*")|("\n(""\n)*" )| )example(( "\n(""\n)*")|("\n(""\n)*" )| )string"'

That's a bit confusing, but all it is is the string you want to match, but it starts with:

(""\n)*"

and has replaces the spaces between each word with:

(( "\n(""\n)*")|("\n(""\n)*" )| )

which checks for three different possibilities after each word, either a "space, quote, newline, (unlimited number of empty strings) quote", or that same sequence but more the space to the end, or just a space.

A much easier way to get this working would be to write a little function that would take in the string you are trying to match and return the regex that will match it:

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

So, if you had the file you posted in a string called "filestring", you would get the matches like this:

import re

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

matcher = re.compile(getregex("This is an example string"))

for i in matcher.finditer(filestring):
    print i.group(0), "\n"

>>> "This is "
    "an example string"

    "This is an example string"

    ""
    "This is an "
    "example"
    " string"

This regex doesn't take into account the space you have after "example" in the third msgid, but I assume this is generated by a machine and that's a mistake.

Upvotes: 0

pwuertz
pwuertz

Reputation: 1325

Do you want to match a series of words? If so, you could look for words with just spaces (\s) in between, since \s matches newlines and spaces alike.

import re

search_for = "This is an example string"
search_for_re = r"\b" + r"\s+".join(search_for.split()) + r"\b"
pattern = re.compile(search_for_re)
match = lambda s: pattern.match(s) is not None

s = "This is an example string"
print match(s), ":", repr(s)

s = "This is an \n example string"
print match(s), ":", repr(s)

s = "This is \n an unmatching string"
print match(s), ":", repr(s)

Prints:

True : 'This is an example string'
True : 'This is an \n example string'
False : 'This is \n an unmatching string'

Upvotes: 4

Related Questions