Reputation: 997
I am trying to understand re.split() function with non-capturing group to split a comma delimited string.
This is my code:
pattern = re.compile(r',(?=(?:"[^"]*")*[^"]*$)')
text = 'qarcac,"this is, test1",123566'
results= re.split(pattern, text)
for r in results:
print(r.strip())
When I execute this code, the results are as expected.
split1: qarcac
split2: "this is, test1"
split3: 123566
whereas if i add one more double quoted string to the source text, it doesn't work as expected.
text = 'qarcac,"this is, test1","this is, test2", 123566, testdata'
and produces the below output
split1: qarcac,"this is, test1"
split2: "this is, test2"
split3: 123566
Can someone explain me what's going on here and how non-capturing group works differently in these two cases?
Upvotes: 1
Views: 2302
Reputation: 85757
This has nothing to do with (non-)capturing groups.
(?:"[^"]*")*[^"]*$
matches:
"[^"]*"
- a quoted string (two quotes with 0 or more non-quotes in between)(?: ... )*
- 0 or more of those quoted strings[^"]*
- followed by 0 or more non-quotes$
- followed by the end of the stringIn other words, this regex matches something like "foo""bar""baz"otherstuff
.
In your first example, the target string is:
qarcac,"this is, test1",123566
^^^^^^^^^^^^^^^^^^^^^^^
I've underlined the part that is matched by the above regex (a quoted part followed by an unquoted tail followed by the end of the string).
In your second example, the target string is:
qarcac,"this is, test1","this is, test2", 123566, testdata
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Again, I've underlined the part that is matched by the regex.
The first quoted part is not matched because of the comma:
"this is, test1","this is, test2"
X
"foo","bar"
is not matched because your regex requires the quoted parts to be right next to each other, as in "foo""bar"
, with nothing in between.
If you just want to make sure that every matched comma is outside of a quoted part (i.e. is followed by an even number of quotes), you can simply use
,(?=[^"]*(?:"[^"]*"[^"]*)*$)
as your regex.
Upvotes: 1