Reputation: 3832
I would like to capture all [[A-Za-z].]+
in my string, that is, all repeats of a alphabetic character followed by a dot.
So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."
I would like to pull out "A.B.C."
and "U.V.W.X."
only (as they are repeats of one character followed by a dot).
It seems almost that I need a recursive regex to do this [[A-Za-z].]+
.
Is it possible to implement this with either python's re
module or regex
module?
Upvotes: 1
Views: 187
Reputation: 368954
Using positive look-around assertions:
>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']
UPDATE As @bubblebobble suggested, you the regex could be simplified using \S
(non-space character) with negative look-around assertions:
pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'
Upvotes: 1
Reputation: 12434
This will work for you, using simple re.findall notation:
(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+
In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)
You can see it working here: https://regex101.com/r/ZwW7c7/4
Python Code (that I wrote):
import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))
Output:
['D.E.F.', 'A.B.C.', 'U.V.W.X.']
Upvotes: 1
Reputation: 25779
You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:
<!-- language: lang-py -->
import re
MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)
your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z." # get a list of matches
print(MATCH_GROUPS.findall(your_string)) # ['A.B.C.', 'U.V.W.X.']
A bit clunky but should get the job done with edge cases as well.
P.S. The above will match single occurrences as well (e.g. A.
if it appears as standalone) if you're seeking for multiple repeats only, replace the +
(one or more repeats) with a range of your choice (e.g. {2,}
for two or more repeats).
edit: A small change to match beginning/end of string boundaries as well.
Upvotes: 1
Reputation: 49
This regex seems to do the job (testing if we are on the beginning of the string or after a space) :
\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+
EDIT : Sorry Shawn didn't see your modified answer
Upvotes: 0