Reputation: 81
I'm tying to capture multiple lines that are between a special keyword and separated by newlines.
text = """
KeyWord some text
Data: 012
***coconut***
list[123]
par(098)
Finish me
KeyWord random random text
Data: 1257
Cowboy
***mango***
list[121343]
par(afsd)
Catwoman
Tamarindo
Gotic
Gotham
KeyWord another text
Data: 532
***banana***
It can have more lines
And more
And more
list[dhf]
par(345)
"""
As you can see every 'paragraph' starts with KeyWord
and it has a different number of lines. I want to grab each paragraph that is separated by n blank lines, and put them into a list, so I can later iterate over the list that should only contains lines with text (the paragraphs). The length of the final list should be 3. And should not contain blank lines, only lines with characters.
I tried the following with no success:
pattern = re.compile(r'KeyWord .+KeyWord',re.DOTALL)
Upvotes: 1
Views: 196
Reputation: 163632
You could get the matches without using re.DOTALL
to prevent unnecessary backtracking.
If the KeyWord
is always at the start of the line, you could use an anchor ^
and re.MULTILINE
^KeyWord\b.*(?:\r?\n(?!KeyWord\b).*)*
Explanation
^KeyWord\b
Start of line, match KeyWord and word boundary.*
Match 0+ times any char except a newline(?:
Non capture goup
\r?\n
Match a newline(?!KeyWord\b).*
Assert what is directly to the right is not KeyWord and match the whole line)*
Close group and repeat 0+ timesExample code
result = re.findall(r"^KeyWord\b.*(?:\r?\n(?!KeyWord\b).*)*", text, re.MULTILINE)
print(result)
print(len(result))
Output
['KeyWord some text\nData: 012\n***coconut***\nlist[123]\npar(098)\nFinish me\n\n', 'KeyWord random random text\nData: 1257\nCowboy\n***mango***\nlist[121343]\npar(afsd)\nCatwoman\nTamarindo\nGotic\nGotham\n\n\n\n', 'KeyWord another text\nData: 532\n***banana***\nIt can have more lines\nAnd more\nAnd more\nlist[dhf]\npar(345)\n\n\n']
3
Upvotes: 0
Reputation: 522752
I would use re.findall
here:
paragraphs = re.findall(r'\bKeyWord(.*?)(?=\bKeyWord\b|$)', text, flags=re.DOTALL)
print(paragraphs)
This prints:
[' some text\nData: 012\n***coconut***\nlist[123]\npar(098)\nFinish me\n\n\n',
' random random text\nData: 1257\nCowboy\n***mango***\nlist[121343]\npar(afsd)\nCatwoman\nTamarindo\nGotic\nGotham\n\n\n\n\n',
' another text\nData: 532\n***banana***\nIt can have more lines\nAnd more\nAnd more\nlist[dhf]\npar(345)\n\n']
The regex logic here is to capture what follows the keyword up to, but including, an occurrence of either the next keyword, or the end of the input.
Upvotes: 1