calpyte
calpyte

Reputation: 885

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:

2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15

The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:

2015,2135,602832/09,DOYLE V ICON LLC,15,15

This is what I have tried so far, without success however:

new_text = re.sub(r'([\w],[\s\w|\w])', "", text) 

Any ideas where I am wrong?

Help would be much appreciated!

Upvotes: 3

Views: 3385

Answers (2)

Quinn
Quinn

Reputation: 4504

\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.

([a-zA-Z]),(\s|[a-zA-Z])

Here is the DEMO.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).

So, the main problem is that \w matches both letters and digits.

You can actually leverage lookarounds:

(?<=[a-zA-Z]),(?=[a-zA-Z\s])

See the regex demo

The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.

Here is a Python demo:

import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)

If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:

(?<=[^\W\d_]),(?=[^\W\d_]|\s)

See another regex demo

Upvotes: 7

Related Questions