Darknight
Darknight

Reputation: 1152

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.

>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>

>>>re.match(r'(\w+):(\d+)','a-b:1')
>>> 

Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?

However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?

Upvotes: 2

Views: 4394

Answers (3)

thefourtheye
thefourtheye

Reputation: 239453

Match's docs say

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.

match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.

print re.match(r'(\w+)','a-b').group()

will print

a

In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.

How can I check whether the given string doesn't contain any special characters?

I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match

Remember, you

Upvotes: 1

poke
poke

Reputation: 387587

Taking a look at the actual match will give you an idea of what happens.

>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)

As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.

Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.

Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:

>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')

For further information on the search vs. match topic, check this section in the manual.

And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:

>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')

Upvotes: 6

Jon Clements
Jon Clements

Reputation: 142136

The first matches the a - one or more word chars.

The second is one or more word chars immediately followed by a : which there aren't...

[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.

Upvotes: 2

Related Questions