Kevin Lee
Kevin Lee

Reputation: 1191

Strange behavior with Python 3 re.sub

The following code:

import re
print(re.sub('[^a-zA-Z0-9]', '', ',Inc.', re.IGNORECASE).lower())
print(re.sub('[^a-zA-Z0-9]', '', ', Inc.', re.IGNORECASE).lower())

produces:

inc
inc.

https://repl.it/repls/RightThankfulMaintenance

Why?

Upvotes: 2

Views: 90

Answers (1)

paxdiablo
paxdiablo

Reputation: 881333

From the doco, the re.sub signature is:

re.sub(pattern, repl, string, count=0, flags=0)

So, let's examine your call based on that:

re.sub('[^a-zA-Z0-9]', ''    , ', Inc.', re.IGNORECASE) # default
#       <  pattern  >  <repl>  <string>  <   count   >    <flags>

You are passing the flag re.IGNORECASE (it has the value 2 if you print(int(re.IGNORECASE)), though I suspect that's not mandated anywhere) as the count to use.

So it only does up to two substitutions, which is the comma and the space at the start in your second example. It also did that in your first example, it's just that there was only one character that matched rather than three, so you didn't notice.

Instead, you should use:

>>> re.sub('[^a-zA-Z0-9]', '', ', Inc.', flags=re.IGNORECASE).lower()
'inc'

Upvotes: 3

Related Questions