camagu4
camagu4

Reputation: 81

Capturing text inside keywords using regular expression

I'm tying to capture multiple lines that are between a special keyword and separated by newlines.

text = """
KeyWord some text
Data: 012
***coconut***
list[123]
par(098)
Finish me


KeyWord random random text
Data: 1257
Cowboy
***mango***
list[121343]
par(afsd)
Catwoman
Tamarindo
Gotic
Gotham




KeyWord another text
Data: 532
***banana***
It can have more lines
And more
And more
list[dhf]
par(345)


"""

As you can see every 'paragraph' starts with KeyWord and it has a different number of lines. I want to grab each paragraph that is separated by n blank lines, and put them into a list, so I can later iterate over the list that should only contains lines with text (the paragraphs). The length of the final list should be 3. And should not contain blank lines, only lines with characters.

I tried the following with no success:

pattern = re.compile(r'KeyWord .+KeyWord',re.DOTALL)

Upvotes: 1

Views: 196

Answers (2)

The fourth bird
The fourth bird

Reputation: 163632

You could get the matches without using re.DOTALL to prevent unnecessary backtracking.

If the KeyWord is always at the start of the line, you could use an anchor ^ and re.MULTILINE

^KeyWord\b.*(?:\r?\n(?!KeyWord\b).*)*

Explanation

  • ^KeyWord\b Start of line, match KeyWord and word boundary
  • .* Match 0+ times any char except a newline
  • (?: Non capture goup
    • \r?\n Match a newline
  • (?!KeyWord\b).* Assert what is directly to the right is not KeyWord and match the whole line
  • )* Close group and repeat 0+ times

Regex demo | Python demo

Example code

result = re.findall(r"^KeyWord\b.*(?:\r?\n(?!KeyWord\b).*)*", text, re.MULTILINE)
print(result)
print(len(result))

Output

['KeyWord some text\nData: 012\n***coconut***\nlist[123]\npar(098)\nFinish me\n\n', 'KeyWord random random text\nData: 1257\nCowboy\n***mango***\nlist[121343]\npar(afsd)\nCatwoman\nTamarindo\nGotic\nGotham\n\n\n\n', 'KeyWord another text\nData: 532\n***banana***\nIt can have more lines\nAnd more\nAnd more\nlist[dhf]\npar(345)\n\n\n']
3

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522752

I would use re.findall here:

paragraphs = re.findall(r'\bKeyWord(.*?)(?=\bKeyWord\b|$)', text, flags=re.DOTALL)
print(paragraphs)

This prints:

[' some text\nData: 012\n***coconut***\nlist[123]\npar(098)\nFinish me\n\n\n',
 ' random random text\nData: 1257\nCowboy\n***mango***\nlist[121343]\npar(afsd)\nCatwoman\nTamarindo\nGotic\nGotham\n\n\n\n\n',
 ' another text\nData: 532\n***banana***\nIt can have more lines\nAnd more\nAnd more\nlist[dhf]\npar(345)\n\n']

The regex logic here is to capture what follows the keyword up to, but including, an occurrence of either the next keyword, or the end of the input.

Upvotes: 1

Related Questions