killajoule
killajoule

Reputation: 3832

Recursive regex in python regex module?

I would like to capture all [[A-Za-z].]+ in my string, that is, all repeats of a alphabetic character followed by a dot.

So for example, in "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."

I would like to pull out "A.B.C." and "U.V.W.X." only (as they are repeats of one character followed by a dot).

It seems almost that I need a recursive regex to do this [[A-Za-z].]+.

Is it possible to implement this with either python's re module or regex module?

Upvotes: 1

Views: 187

Answers (4)

falsetru
falsetru

Reputation: 368954

Using positive look-around assertions:

>>> import re
>>> pattern = r'(?:(?<=\s)|^)(?:[A-Za-z]\.)+(?:(?=\s)|$)'
>>> re.findall(pattern, 'ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'A.B.C. UVWX U.V.W.X. XYZ XY.Z.')
['A.B.C.', 'U.V.W.X.']
>>> re.findall(pattern, 'DEF A.B.C. UVWX U.V.W.X.Y')
['A.B.C.']

UPDATE As @bubblebobble suggested, you the regex could be simplified using \S (non-space character) with negative look-around assertions:

pattern = r'(?<!\S)(?:[A-Za-z]\.)+(?!\S)'

Upvotes: 1

Shawn Tabrizi
Shawn Tabrizi

Reputation: 12434

This will work for you, using simple re.findall notation:

(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+

In the regex, I first check if it is the start of the string, or if there is a space before the string, and then i check for repetitive letter+period. I place the parts i do not want to capture into a non-capture group (?:...)

You can see it working here: https://regex101.com/r/ZwW7c7/4

Python Code (that I wrote):

import re
regex = r"(?:(?<=\s)|(?<=^))(?:[A-Za-z]\.)+"
string = 'D.E.F. ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z.'
print(re.findall(regex,string))

Output:

['D.E.F.', 'A.B.C.', 'U.V.W.X.']

Upvotes: 1

zwer
zwer

Reputation: 25779

You can use a non-capturing group to define your match, then group its repeats nested between boundary characters (in this case anything that's not a letter or a dot) and capture all matched groups:

<!-- language: lang-py -->

import re

MATCH_GROUPS = re.compile(r"(?:[^a-z.]|^)((?:[a-z]\.)+)(?:[^a-z.]|$)", re.IGNORECASE)

your_string = "ABC A.B.C. UVWX U.V.W.X. XYZ XY.Z."  # get a list of matches

print(MATCH_GROUPS.findall(your_string))  # ['A.B.C.', 'U.V.W.X.']

A bit clunky but should get the job done with edge cases as well.

P.S. The above will match single occurrences as well (e.g. A. if it appears as standalone) if you're seeking for multiple repeats only, replace the + (one or more repeats) with a range of your choice (e.g. {2,} for two or more repeats).

edit: A small change to match beginning/end of string boundaries as well.

Upvotes: 1

VdF
VdF

Reputation: 49

This regex seems to do the job (testing if we are on the beginning of the string or after a space) :

\A([A-Za-z]\.)+|(?<=\s)([A-Za-z]\.)+

EDIT : Sorry Shawn didn't see your modified answer

Upvotes: 0

Related Questions