Reputation: 48725
I am running my unit test suit with Python 3 on code that was developed under Python 2. All unit tests passed under Python 2 but not for Python 3. It seems there is some change in the implementation of re
, and it is a real head scratcher for me. Below is a minimal working example to replicate my problem:
Python 2.7.6 (default, Dec 10 2013, 20:01:46)
>>> import re
>>> a = re.compile('test', re.IGNORECASE)
>>> assert a.flags == re.IGNORECASE
>>> # No output, i.e. assertion passed
>>> a.flags
2
>>> re.IGNORECASE
2
Python 3.3.3 (default, Dec 10 2013, 20:13:18)
>>> import re
>>> a = re.compile('test', re.IGNORECASE)
>>> assert a.flags == re.IGNORECASE
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
>>> a.flags
34
>>> re.IGNORECASE
2
Clearly something is going on that I don't expect! I am assuming that there is some set of default flags that are OR'd together to make flags
be 34 in python3. What I want to know is what these are so that I can make my assertion pass by comparing against the proper flags. As a bonus, what is the purpose for this?
Upvotes: 1
Views: 3638
Reputation: 239453
Following are the RegEx flags, in Python 3.x.
import re
print (re.IGNORECASE)
print (re.LOCALE)
print (re.MULTILINE)
print (re.DOTALL)
print (re.UNICODE)
print (re.VERBOSE)
print (re.DEBUG)
print (re.A)
Output
2
4
8
16
32
64
128
256
From the docs
,
Strings are immutable sequences of Unicode code points.
So, re.UNICODE
flag is enabled by default. Since you have enabled re.IGNORECASE
, that is ORed with re.UNICODE
and that gives you 34
.
Upvotes: 5
Reputation: 48725
After digging through the re
source code, I found the following in "sre_parse.py":
def fix_flags(src, flags):
# Check and fix flags according to the type of pattern (str or bytes)
if isinstance(src, str):
if not flags & SRE_FLAG_ASCII:
flags |= SRE_FLAG_UNICODE # <===== LOOK AT THIS LINE!!!!!
elif flags & SRE_FLAG_UNICODE:
raise ValueError("ASCII and UNICODE flags are incompatible")
else:
if flags & SRE_FLAG_UNICODE:
raise ValueError("can't use UNICODE flag with a bytes pattern")
return flags
If the "UNICODE" flag is not added, it is added for you. It's value is SRE_FLAG_UNICODE == 32
, so 2 | 32 == re.IGNORECASE | re.UNICODE == 34
.
This function does not exist in python2.x's implementation.
Upvotes: 1