Reputation: 17466
I want a regEx to match some text that contains both alpha and numeric chars. But I do NOT want it to match only alpha or numbers. E.g. in python:
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
import re
rr = re.compile('([0-9a-z]{8})')
print 'sub=', rr.sub('########', s)
print 'findall=', rr.findall(s)
generates following output:
sub= [########: ########]: STARTED at ########ng job number ########
findall= ['mytaskid', '3fee46d2', 'processi', '10022001']
I want it to be:
sub= [mytaskid: ########]: STARTED at processing job number 10022001
findall= ['3fee46d2']
Any ideas... ??
In this case it's exactly 8 chars always, it would be even more wonderful to have a regEx that doesn't have {8}
in it, i.e. it can match even if there are more or less than 8 chars.
-- edit --
Question is more to understand if there is a way to write a regEx such that I can combine 2 patterns (in this case [0-9]
and [a-z]
) and ensure the matched string matches both patterns, but number of chars matched from each set is variable. E.g. s could also be
s = 'mytaskid 3fee46d2 STARTED processing job number 10022001'
-- answer --
Thanks to all for the answers, all them give me what I want, so everyone gets a +1 and the first one to answer gets the accepted answer. Although jerry explains it the best. :)
If anyone is a stickler for performance, there is nothing to choose from, they're all the same.
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
# ^^^^^^^^ <- I want something that'll only match this part.
def testIt(regEx):
from timeit import timeit
s = '[mytaskid: 3333fe46d2]: STARTED at processing job number 10022001'
assert (re.sub('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b', '########', s) ==
'[mytaskid: ########]: STARTED at processing job number 10022001'), '"%s" does not work.' % regEx
print 'sub() with \'', regEx, '\': ', timeit('rr.sub(\'########\', s)', number=500000, setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
print 'findall() with \'', regEx, '\': ', timeit('rr.findall(s)', setup='''
import re
s = '%s'
rr = re.compile('%s')
''' % (s, regEx)
)
testIt('\\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\\b')
testIt('\\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\\b')
testIt('\\b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\\b')
testIt('\\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\\b')
produced:
sub() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.328042736387
findall() with ' \b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b ': 0.350668751542
sub() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.314759661193
findall() with ' \b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b ': 0.35618526928
sub() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.322802906619
findall() with ' \b(?=[a-z0-9]*[0-9])[a-z0-9]*[a-z][a-z0-9]*\b ': 0.35330467656
sub() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.320779061371
findall() with ' \b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b ': 0.347522144274
Upvotes: 3
Views: 650
Reputation: 11703
Try following regex:
\b[0-9a-z]*(?:[a-z][0-9]|[0-9][a-z])[0-9a-z]*\b
This will match a word containing a digit followed an alphabet or vice versa.
Hence it will cover a complete set of those words which contain at-least one digit and one alphabet.
Note: Although it is not the case with python, I have observed that not all varieties of tools support lookahead and lookbehind. So I prefer to avoid them if possible.
Upvotes: 4
Reputation: 6935
Not the most beautiful regular expression, but it works:
\b[a-z\d]*(?:\d[a-z]|[a-z]\d)[a-z\d]*\b
Upvotes: 2
Reputation: 71548
If the format is the same each time, that is:
[########: ########]: STARTED at ########ng job number ########
You can use:
([^\]\s]+)\]
With re.findall
, or re.search
and getting .group(1)
if you use re.search
.
[^\]\s]+
is a negated class and will match any character except space (and family) or closing square bracket.
The regex basically looks for characters (except ]
or spaces) up until a closing square bracket.
If you want to match any string containing both alpha and numeric characters, you will need a lookahead:
\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b
Used like so:
result = re.search(r'\b(?=[0-9]*[a-z])(?=[a-z]*[0-9])[a-z0-9]+\b', text, re.I)
re.I
is for ignorecase.
\b
is a word boundary and will match only between a 'word' character and a 'non-word' character (or start/end of string).
(?=[0-9]*[a-z])
is a positive lookahead and makes sure there's at least 1 alpha in the part to be matched.
(?=[a-z]*[0-9])
is a similar lookahead but checks for digits.
Upvotes: 1
Reputation: 14873
You need to use the look ahead (?=...)
.
This one matches all words with at least one out of [123] and [abc].
>>> re.findall('\\b(?=[abc321]*[321])[abc321]*[abc][abc321]*\\b', ' 123abc 123 abc')
['123abc']
This way you can do AND for constraints to the same string.
>>> help(re)
(?=...) Matches if ... matches next, but doesn't consume the string.
An other way is to ground it and to say: with one of [abc] and one of [123] means there is at least a [123][abc] or a [abc][123] in the string resulting in
>>> re.findall('\\b[abc321]*(?:[abc][123]|[123][abc])[abc321]*\\b', ' 123abc 123 abc')
['123abc']
Upvotes: 2
Reputation: 693
You can use more specific regular expression and skip the findall.
import re
s = '[mytaskid: 3fee46d2]: STARTED at processing job number 10022001'
mo = re.search(':\s+(\w+)', s)
print mo.group(1)
Upvotes: 0