Reputation: 343
Let's say I have this text:
abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
I want to capture these personal names:
Mark Jones, Taylor Daniel Lautner, Allan Stewart Konigsberg Farrow.
Basically, when we find (P followed by any capital letter, we capture the n previous words that start with a capital letter.
What I have achieved so far is to capture just one previous word with this code: \w+(?=\s+(\(P+[A-Z]))
. But I couldn't evolve from that.
I appreciate it if someone can help :)
Upvotes: 2
Views: 96
Reputation: 71689
Regex pattern
\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]
In order to find all matching occurrences of the above regex pattern we can use re.findall
import re
text = """abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
"""
matches = re.findall(r'\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]', text)
>>> matches
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']
Regex details
\b
: Word boundary to prevent partial matches((?:[A-Z]\w+\s?)+)
: First Capturing group
(?:[A-Z]\w+\s?)+
: Non capturing group matches one or more times
[A-Z]
: Matches a single alphabet from capital A
to Z
\w+
: Matches any word character one or more times\s?
: Matches any whitespace character zero or one times\s
: Matches a single whitespace character\(
: Matches the character (
literallyP
: Matches the character P
literally[A-Z]
: Matches a single alphabet from capital A
to Z
See the online regex demo
Upvotes: 3
Reputation: 133428
With your shown samples, could you please try following. Using Python's re
library here to fetch the results. Firstly using findall
to fetch all values from given string var where (.*?)\s+\((?=P[A-Z])
will catch everything which is having P and a capital letter after it, then creating a list lst. Later using substitute function to substitute everything non-spacing things followed by spaces 1st occurrences with NULL to get exact values.
import re
var="""abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)"""
lst = re.findall(r'(.*?)\s+\((?=P[A-Z])',var)
[re.sub(r'^\S+\s+','',s) for s in lst]
Output will be as follows:
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']
Upvotes: 3