Reputation: 845
I have a number of news articles, some which have intro and end statements. The possible combinations are...
What I would like to do is to return "Some text about a news story." in each case. I have the below regex which returns the 1st and 2nd example. I am strugling when there is either intro or end statements.
re.search(r'(?i)(?<=: ).*(?=Read more|Full story|\. Source)', str(doc)).group()
# "(?i)" to ignore case.
# "(?<=: )" to capture text after and excluding ": "
# ".*" match everything between the two patterns.
# "(?=Read more|Full story|\. Source)" match everything before these three strings.
Upvotes: 1
Views: 156
Reputation: 627101
It seems you may use
import re
doc = "The BBC reports: Some text about a news story. Read more on BBC.com."
rx = r'(?i)(?:[^:\n]*:\s*|^)(.*?)(?:$|Read more|Full story|\. Source)'
m = re.search(rx, doc)
if m:
print(m.group(1))
See the regex demo.
Details
(?i)
- ignore case flag(?:[^:\n]*:\s*|^)
- a non-capturing group matching either 0+ chars other than :
and a newline followed with :
and then 0+ whitespaces or start of string(.*?)
- Group 1: any 0+ chars other than line break chars as few as possible(?:$|Read more|Full story|\. Source)
- a non-capturing group matching Read more
, Full story
or . Source
.Upvotes: 1