Jones
Jones

Reputation: 343

Capture the n previous words when matching a string

Let's say I have this text:

abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)

I want to capture these personal names:

Mark Jones, Taylor Daniel Lautner, Allan Stewart Konigsberg Farrow.

Basically, when we find (P followed by any capital letter, we capture the n previous words that start with a capital letter.

What I have achieved so far is to capture just one previous word with this code: \w+(?=\s+(\(P+[A-Z])). But I couldn't evolve from that. I appreciate it if someone can help :)

Upvotes: 2

Views: 96

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Regex pattern

\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]

In order to find all matching occurrences of the above regex pattern we can use re.findall

import re

text = """abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)
"""

matches = re.findall(r'\b((?:[A-Z]\w+\s?)+)\s\(P[A-Z]', text)

>>> matches
['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']

Regex details

  • \b : Word boundary to prevent partial matches
  • ((?:[A-Z]\w+\s?)+): First Capturing group
    • (?:[A-Z]\w+\s?)+: Non capturing group matches one or more times
      • [A-Z]: Matches a single alphabet from capital A to Z
      • \w+: Matches any word character one or more times
      • \s? : Matches any whitespace character zero or one times
  • \s : Matches a single whitespace character
  • \(: Matches the character ( literally
  • P : Matches the character P literally
  • [A-Z] : Matches a single alphabet from capital A to Z

See the online regex demo

Upvotes: 3

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples, could you please try following. Using Python's re library here to fetch the results. Firstly using findall to fetch all values from given string var where (.*?)\s+\((?=P[A-Z]) will catch everything which is having P and a capital letter after it, then creating a list lst. Later using substitute function to substitute everything non-spacing things followed by spaces 1st occurrences with NULL to get exact values.

import re
var="""abcdefg Mark Jones (PP) etc etc
akslaskAS Taylor Daniel Lautner (PMB) blabla
etcetc Allan Stewart Konigsberg Farrow (PRTW)"""

lst = re.findall(r'(.*?)\s+\((?=P[A-Z])',var)
[re.sub(r'^\S+\s+','',s) for s in lst]

Output will be as follows:

['Mark Jones', 'Taylor Daniel Lautner', 'Allan Stewart Konigsberg Farrow']

Upvotes: 3

Related Questions