KameeCoding
KameeCoding

Reputation: 723

Python matching regex multiple times in a row (not the findall way)

This question is not asking about finding 'a' multiple times in a string etc.

What I would like to do is match:

[ a-zA-Z0-9]{1,3}\.

regexp multiple times, one way of doing this is using |

'[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.|[ a-zA-Z0-9]{1,3}\.[ a-zA-Z0-9]{1,3}\.'

so this matches the regexp 4 or 3 or 2 times. Matches stuff like:

a. v. b.
m a.b.

Is there any way to make this more coding like?

I tried doing

([ a-zA-Z0-9]{1,3}\.){2,4} 

but the functionality is not the same what I expected. THis one matches:

regex.findall(string)
[u' b.', u'b.']

string is:

a. v. b. split them a.b. split somethinf words. THen we say some more words, like ten

Is there any way to do this? THe goal is to match possible english abbreviations and names like Mary J. E. things that the sentence tokenizer recognizes as sentence punctuation but are not.

I want to match all of this:

U.S. , c.v.a.b. , a. v. p. 

Upvotes: 2

Views: 1169

Answers (1)

Kasravnd
Kasravnd

Reputation: 107287

first of all Your regex will work as you expect :

>>> s="aa2.jhf.jev.d23.llo."
>>> import re
>>> re.search(r'([ a-zA-Z0-9]{1,3}\.){2,4}',s).group(0)
'aa2.jhf.jev.d23.'

But if you want to match some sub strings like U.S. , c.v.a.b. , a. v. p. you need to put the whole of regex in a capture group :

>>> s= 'a. v. b. split them a.b. split somethinf words. THen we say' some more 
>>> re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)
[('a. v. b.', ' b.'), ('m a.b.', 'b.')]

then use a list comprehension to get the first matches :

>>> [i[0] for i in re.findall(r'(([ a-zA-Z0-9]{1,3}\.){2,4})',s)]
['a. v. b.', 'm a.b.']

Upvotes: 3

Related Questions