SethMMorton
SethMMorton

Reputation: 48725

Changes in re module between Python 2 and Python 3

I am running my unit test suit with Python 3 on code that was developed under Python 2. All unit tests passed under Python 2 but not for Python 3. It seems there is some change in the implementation of re, and it is a real head scratcher for me. Below is a minimal working example to replicate my problem:

Python 2.7.6 (default, Dec 10 2013, 20:01:46) 
>>> import re
>>> a = re.compile('test', re.IGNORECASE)
>>> assert a.flags == re.IGNORECASE
>>> # No output,  i.e. assertion passed
>>> a.flags
2
>>> re.IGNORECASE
2

Python 3.3.3 (default, Dec 10 2013, 20:13:18)
>>> import re
>>> a = re.compile('test', re.IGNORECASE)
>>> assert a.flags == re.IGNORECASE
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> a.flags
34
>>> re.IGNORECASE
2

Clearly something is going on that I don't expect! I am assuming that there is some set of default flags that are OR'd together to make flags be 34 in python3. What I want to know is what these are so that I can make my assertion pass by comparing against the proper flags. As a bonus, what is the purpose for this?

Upvotes: 1

Views: 3638

Answers (3)

thefourtheye
thefourtheye

Reputation: 239453

Following are the RegEx flags, in Python 3.x.

import re
print (re.IGNORECASE)
print (re.LOCALE)
print (re.MULTILINE)
print (re.DOTALL)
print (re.UNICODE)
print (re.VERBOSE)
print (re.DEBUG)
print (re.A)

Output

2
4
8
16
32
64
128
256

From the docs,

Strings are immutable sequences of Unicode code points.

So, re.UNICODE flag is enabled by default. Since you have enabled re.IGNORECASE, that is ORed with re.UNICODE and that gives you 34.

Upvotes: 5

SethMMorton
SethMMorton

Reputation: 48725

After digging through the re source code, I found the following in "sre_parse.py":

def fix_flags(src, flags):
    # Check and fix flags according to the type of pattern (str or bytes)
    if isinstance(src, str):
        if not flags & SRE_FLAG_ASCII:
            flags |= SRE_FLAG_UNICODE # <===== LOOK AT THIS LINE!!!!!
        elif flags & SRE_FLAG_UNICODE:
            raise ValueError("ASCII and UNICODE flags are incompatible")
    else:
        if flags & SRE_FLAG_UNICODE:
            raise ValueError("can't use UNICODE flag with a bytes pattern")
    return flags

If the "UNICODE" flag is not added, it is added for you. It's value is SRE_FLAG_UNICODE == 32, so 2 | 32 == re.IGNORECASE | re.UNICODE == 34.

This function does not exist in python2.x's implementation.

Upvotes: 1

DSM
DSM

Reputation: 353039

It's because in Python 3, strings are unicode, and so it makes sense for the UNICODE flag to be on by default.

Python 3:

>>> a = re.compile("a")
>>> a.flags
32
>>> [k for k in dir(re) if getattr(re, k) == 32]
['U', 'UNICODE']

Upvotes: 3

Related Questions