Strange behavior with Python 3 re.sub

Question

The following code:

import re
print(re.sub('[^a-zA-Z0-9]', '', ',Inc.', re.IGNORECASE).lower())
print(re.sub('[^a-zA-Z0-9]', '', ', Inc.', re.IGNORECASE).lower())

produces:

inc
inc.

https://repl.it/repls/RightThankfulMaintenance

Why?

paxdiablo · Accepted Answer

From the doco, the re.sub signature is:

re.sub(pattern, repl, string, count=0, flags=0)

So, let's examine your call based on that:

re.sub('[^a-zA-Z0-9]', ''    , ', Inc.', re.IGNORECASE) # default
#       <  pattern  >      <   count   >

You are passing the flag re.IGNORECASE (it has the value 2 if you print(int(re.IGNORECASE)), though I suspect that's not mandated anywhere) as the count to use.

So it only does up to two substitutions, which is the comma and the space at the start in your second example. It also did that in your first example, it's just that there was only one character that matched rather than three, so you didn't notice.

Instead, you should use:

>>> re.sub('[^a-zA-Z0-9]', '', ', Inc.', flags=re.IGNORECASE).lower()
'inc'

Strange behavior with Python 3 re.sub

Answers (1)

Related Questions