Reputation: 19329
To extract first three letters 'abc' and three sets of three-digits numbers in 000_111_222
I am using the following expression:
text = 'abc_000_111_222'
print re.findall('^[a-z]{3}_[0-9]{3}_[0-9]{3}_[0-9]{3}', text)
But the expression returns empty list when instead of underscores there are minuses or periods used instead: abc.000.111.222
or abc-000-111-222
or any combination of it like: abc_000.111-222
Sure I could use a simple replace method to unify the text variable text=text.replace('-','_').replace('.','_')
But I wonder if instead of replacing I could modify regex expression that would recognize the underscores, minuses and periods.
Upvotes: 0
Views: 141
Reputation: 6185
Why not abandon regexes
altogether, and use a clearer and simpler solution?
$ cat /tmp/tmp.py
SEP = '_.,;-=+'
def split_str(text):
for s in list(SEP):
res = text.split(s)
if len(res) > 1:
return text.split(s)
print(split_str('abc_000_111_222'))
print(split_str('abc;000;111;222'))
print(split_str('abc.000.111.222'))
print(split_str('abc-000-111-222'))
Which gives:
$ python3 /tmp/tmp.py
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']
['abc', '000', '111', '222']
$
Upvotes: -1
Reputation: 30995
You can use regex character classes with [
...]
. For your case, it can be [_.-]
(note the hyphen at the end, if it isn't at the end, it will be considered as a range like [a-z]
).
You can use a regex like this:
print re.findall('^[a-z]{3}[_.-][0-9]{3}[_.-][0-9]{3}[_.-][0-9]{3}', text)
Btw, you can shorten your regex to have something like this:
print re.findall('^[a-z]{3}[_.-](\d{3}[_.-]){2}\d{3}', text)
Just as a comment, in case you want to match the same separator, then you can use capture groups and reference its content like this:
^[a-z]{3}([_.-])[0-9]{3}\1[0-9]{3}\1[0-9]{3}
Upvotes: 3