Reputation: 8177
Is there a way in regex's to match a string that is arbitrarily split over multiple lines - say we have the following format in a file:
msgid "This is "
"an example string"
msgstr "..."
msgid "This is an example string"
msgstr "..."
msgid ""
"This is an "
"example"
" string"
msgstr "..."
msgid "This is "
"an unmatching string"
msgstr "..."
So we would like to have a pattern that would match all the example strings, ie: match the string regardless of how it's split across lines. Notice that we are after a specific string as shown in the sample, not just any string. So in this case we would like to match the string "This is an example string"
.
Of course we can can easily concat the strings then apply the match, but got me wondering if this is possible. I'm talking Python regex's but a general answer is ok.
Upvotes: 4
Views: 316
Reputation: 3326
This is a bit tricky with the need for quotes on every line, and the allowance of empty lines. Here's a regex that matches the file you posted correctly:
'(""\n)*"This(( "\n(""\n)*")|("\n(""\n)*" )| )is(( "\n(""\n)*")|("\n(""\n)*" )| )an(( "\n(""\n)*")|("\n(""\n)*" )| )example(( "\n(""\n)*")|("\n(""\n)*" )| )string"'
That's a bit confusing, but all it is is the string you want to match, but it starts with:
(""\n)*"
and has replaces the spaces between each word with:
(( "\n(""\n)*")|("\n(""\n)*" )| )
which checks for three different possibilities after each word, either a "space, quote, newline, (unlimited number of empty strings) quote", or that same sequence but more the space to the end, or just a space.
A much easier way to get this working would be to write a little function that would take in the string you are trying to match and return the regex that will match it:
def getregex(string):
return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'
So, if you had the file you posted in a string called "filestring", you would get the matches like this:
import re
def getregex(string):
return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'
matcher = re.compile(getregex("This is an example string"))
for i in matcher.finditer(filestring):
print i.group(0), "\n"
>>> "This is "
"an example string"
"This is an example string"
""
"This is an "
"example"
" string"
This regex doesn't take into account the space you have after "example" in the third msgid, but I assume this is generated by a machine and that's a mistake.
Upvotes: 0
Reputation: 1325
Do you want to match a series of words? If so, you could look for words with just spaces (\s) in between, since \s matches newlines and spaces alike.
import re
search_for = "This is an example string"
search_for_re = r"\b" + r"\s+".join(search_for.split()) + r"\b"
pattern = re.compile(search_for_re)
match = lambda s: pattern.match(s) is not None
s = "This is an example string"
print match(s), ":", repr(s)
s = "This is an \n example string"
print match(s), ":", repr(s)
s = "This is \n an unmatching string"
print match(s), ":", repr(s)
Prints:
True : 'This is an example string'
True : 'This is an \n example string'
False : 'This is \n an unmatching string'
Upvotes: 4