Python regex to identify two consecutive capitalized words at the beginning of the line

I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text. Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.

This is the text:

Remggrehte Sertrro

Remggrehte Sertrro We did want a 4-day work week for years.

Perrhhfson Forrtdd

Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

Thanks in advance.

Upvotes: 0

Views: 127

Answers (3)

lukasDamaceno
lukasDamaceno

Reputation: 11

You can use this pattern /^([A-Z]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163362

You don't need the positive lookahead to match the first 2 capitalized words.

In your pattern, this part (?=\s[A-Z]) can be omitted as your first assert it and then directly match it.


You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right

^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)

Explanation

  • ^ Start of string
  • [A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
  • [^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
  • [A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
  • (?!\S) Assert a whitespace boundary at the right

Regex demo

Note that [A-Z][a-z]+ matches only chars a-z. To match word characters you could use \w instead of [a-z] only.

Upvotes: 0

Laurent LAPORTE
Laurent LAPORTE

Reputation: 22952

You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$". This regex will match each name only if it fits in the line without extra text.

Here is a demo:

import re

text = """\
Remggrehte Sertrro

Remggrehte Sertrro We did want a 4-day work week for years.

Perrhhfson Forrtdd

Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""

print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))

You get:


Remggrehte Sertrro We did want a 4-day work week for years.


Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

Upvotes: 0

Related Questions