Reputation: 61041
I am reading a book and they provide an example of how to match a given string with regular expressions. Here is their example:
b*(abb*)*(a|∊) - Strings of a's and b's with no consecutive a's.
Now I've tried converting it to python like so:
>> p = re.compile(r'b*(abb*)*(a|)') # OR
>> p = re.compile(r'b*(abb*)*(a|\b)')
# BUT it still doesn't work
>>> p.match('aa')
<_sre.SRE_Match object at 0x7fd9ad028c68>
My question is two-fold:
Clarification: For people asking what standard regex is - it is the formal language theory standard: http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory
Upvotes: 5
Views: 2568
Reputation: 61041
Thanks for the answers. I feel each answer had part of the answer. Here is what I was looking for.
? symbol is just a shorthand for (something|ε). Thus (a|ε) can be rewritten as a?. So the example becomes:
b*(abb*)*a?
In python we would write:
p = re.compile(r'^b*(abb*)*a?$')
The reason straight translation of regular regular expression syntax into python (i.e. copy and paste) does not work is because python matches the shortest substring (if the symbols $ or ^ are absent) while the theoretical regular expressions match longest initial substring.
So for example if we had a string:
s = 'aa'
Our textbook regex b*(abb*)*a? would not match it because it has two a's. However if we copy it straight to python:
>> p = re.compile(r'b*(abb*)*a?')
>> bool(p.match(s))
True
This is because our regex matches only the substring 'a' of our string 'aa'.
In order to tell python to do a match on the whole string we have to tell it where the beginning and the end of the string is, with the ^ and $ symbols respectively:
>> p = re.compile(r'^b*(abb*)*a?$')
>> bool(p.match(s))
False
Note that python regex match() matches at the beginning of the string, so it automatically assumes the ^ at the start. However the search() function does not, and thus we keep the ^.
So for example:
>> s = 'aa'
>> p = re.compile(r'b*(abb*)*a?$')
>> bool(p.match(s))
False # Correct
>> bool(p.search(s))
True # Incorrect - search ignored the first 'a'
Upvotes: 5
Reputation: 43244
Actually, the example works just fine ... to a small details. I would write:
>>> p = re.compile('b*(abb*)*a?')
>>> m = p.match('aa')
>>> print m.group(0)
'a'
>>> m = p.match('abbabbabababbabbbbbaaaaa')
>>> print m.group(0)
abbabbabababbabbbbba
Note that the group 0 returns the part of the string matched by the regular expression.
As you can see, the expression matches a succession of a and b without repetition of a. If indeed, you want to check the whole string, you need to changed slightly:
>>> p = re.compile('^b*(abb*)*a?$')
>>> m = p.match('aa')
>>> print m
None
the ^
and $
force recognition of the beginning and end of the string.
At last, you can combine both methods by using the first regular expression, but testing at the end:
>>> len(m.group(0)) == len('aa')
Added: For the second part of the OT, it seems to me there is no discrepancy between the standard regex and the python implementation. Of course, the notation is slightly different, and the python implementation suggest some extensions (as most other packages).
Upvotes: 5
Reputation: 33984
1
Use bool(p.match('aa'))
to check if the regexp matches or not
p = re.compile('b*(abb*)*a?$')
\b
matches border of string; place between \w
and \W
(word characters and non-word characters)
2
Regexp is quite standard in python. Yet every language has some flavour of them, they are not 100% portable. There are minor differences which you're expected to lookup prior to using regexp in any specific language.
Addition
\epsilon
does not have special symbol in python. It is an empty character set.
In your example a|\epsilon
is equivalent to (a|)
or just a?
. After which $
is obligatory to match end of string.
Upvotes: 3
Reputation: 6456
the problem with your expression is that it matches the empty string, meaning that if you do:
>>> p = re.compile('b*(abb*)*(a|)')
>>> p.match('c').group(0)
''
and since re.match attempts to match the start of the string, you have to tell it to match it until the end of the string. just use $
for that
>>> p = re.compile(r'b*(abb*)*(a|)$')
>>> print p.match('c')
None
>>> p.match('ababababab').group(0)
'ababababab'
ps- you may have noted that i used r'pattern' instead of 'pattern' more on that here (first paragraphs)
Upvotes: 1
Reputation: 19145
Your second re should be an appropriate replacement for epsilon, as best as I understand it, though I've never seen epsilon in a regex before.
For what it's worth, your pattern is matching 'a'. That is to say, it is matching:
b
"s (choosing zero)(abb*)
"s (choosing zero)a
" or word ending (choosing an a).As Jonathan Feinberg pointed out, if you want to ensure the whole string matches, you have to anchor the beginning ('^'
) and end ('$'
) of your regex. You should also use a raw string whenever constructing regexes in python: r'my regex'. That will prevent excessive backslash escaping confusion.
Upvotes: 1
Reputation: 45324
You're matching because your regex matches any zero-width segment of any specimen text. You need to anchor your regex. Here's one way of doing it, using a zero-width lookahead assertion:
re.compile(r'^(a(?!a)|b)*$')
Upvotes: 1
Reputation: 12187
I'm not exactly sure how match works in python, but I think you might need to add ^....$ to your RE. RegExp matching usually matches sub-strings, and it finds the largest match, in the case of p.match('aa') that's "a" (probably the first one). ^...$ makes sure that you're matching the ENTIRE string, which is I believe what you want.
Theoretical/standard reg exps assume that you're always matching the whole string, because you're using it to define a language of strings that match, not find a substring in an input string.
Upvotes: 3