dipankar
dipankar

Reputation: 45

python split on multiple delimiters bug?

I was looking at the responses to this earlier-asked question:

Split Strings with Multiple Delimiters?

For my variant of this problem, I wanted to split on everything that wasn't from a specific set of chars. Which led me to a solution I liked, until I found this apparent bug. Is this a bug or some quirk of python I'm unfamiliar with?

>>> b = "Which_of'these-markers/does,it:choose to;split!on?"
>>> b1 = re.split("[^a-zA-Z0-9_'-/]+", b)
>>> b1
["Which_of'these-markers/does,it", 'choose', 'to', 'split', 'on', '']

I'm not understanding why it doesn't split on a comma (','), given that a comma is not in my exception list?

Upvotes: 4

Views: 298

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

The '-/ inside a character class created a range that includes a comma:

enter image description here

When you need to put a literal hyphen in a Python re pattern, put it:

  • at the start: [-A-Z] (matches an uppercase ASCII letter and -)
  • at the end: [A-Z()-] (matches an uppercase ASCII letter, (, ) or -)
  • after a valid range: [A-Z-+] (matches an uppercase ASCII letter, - or +)
  • or just escape it.

You cannot put it after a shorthand, right before a standalone symbol (as in [\w-+], it will cause a bad character range error). This is valid in .NET and some other regex flavors, but is not valid in Python re.

Put the hyphen at the end of it, or escape it.

Use

re.split(r"[^a-zA-Z0-9_'/-]+", b)

In Python 2.7, you may even contract it to

re.split(r"[^\w'/-]+", b)

Upvotes: 7

Rahul
Rahul

Reputation: 2738

The '-/ is interpreted as range having ascii value from 39 to 47 which includes , having ascii value 44.

You will have to put - either at beginning or end or character class.

Upvotes: 2

Related Questions