BenP
BenP

Reputation: 845

Regex findall between two optional patterns, return all if none

I have a number of news articles, some which have intro and end statements. The possible combinations are...

What I would like to do is to return "Some text about a news story." in each case. I have the below regex which returns the 1st and 2nd example. I am strugling when there is either intro or end statements.

re.search(r'(?i)(?<=: ).*(?=Read more|Full story|\. Source)', str(doc)).group()

# "(?i)" to ignore case.
# "(?<=: )" to capture text after and excluding ": "
# ".*" match everything between the two patterns. 
# "(?=Read more|Full story|\. Source)" match everything before these three strings. 

Upvotes: 1

Views: 156

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

It seems you may use

import re
doc = "The BBC reports: Some text about a news story. Read more on BBC.com."
rx = r'(?i)(?:[^:\n]*:\s*|^)(.*?)(?:$|Read more|Full story|\. Source)'
m = re.search(rx, doc)
if m:
    print(m.group(1))

See the regex demo.

Details

  • (?i) - ignore case flag
  • (?:[^:\n]*:\s*|^) - a non-capturing group matching either 0+ chars other than : and a newline followed with : and then 0+ whitespaces or start of string
  • (.*?) - Group 1: any 0+ chars other than line break chars as few as possible
  • (?:$|Read more|Full story|\. Source) - a non-capturing group matching Read more, Full story or . Source.

Upvotes: 1

Related Questions