Reputation: 23
I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text.
Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.
This is the text:
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Thanks in advance.
Upvotes: 0
Views: 127
Reputation: 11
You can use this pattern /^([A-Z]+.*? ){2}/m
if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com
Upvotes: 1
Reputation: 163362
You don't need the positive lookahead to match the first 2 capitalized words.
In your pattern, this part (?=\s[A-Z])
can be omitted as your first assert it and then directly match it.
You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S)
at the right
^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)
Explanation
^
Start of string[A-Z][a-z]+
Match a char A-Z and 1+ lowercase chars a-z[^\S\r\n]
Match a whitespace char except a newline as \s
could also match a newline and you want to match two consecutive capitalized words at the beginning of the line[A-Z][a-z]+
Match a char A-Z and 1+ lowercase chars a-z(?!\S)
Assert a whitespace boundary at the rightNote that [A-Z][a-z]+
matches only chars a-z. To match word characters you could use \w
instead of [a-z]
only.
Upvotes: 0
Reputation: 22952
You can remove the line which only contains the names using re.MULTILINE
flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$"
. This regex will match each name only if it fits in the line without extra text.
Here is a demo:
import re
text = """\
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""
print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))
You get:
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Upvotes: 0