Reputation: 248
I have this code to split a complicated CSV file into chunks. The hard bit is that commas may also appear within "" and thus those must not be split on. The RegEx I am using to find commas not within "" works fine:
comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')
Demo: here
import re
test = 'Test1,Test2,"",Test3,Test4"",Test5'
comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')
print comma_re.split(test)
Output:
['Test1', 'Test2,"",Test3,Test4""', 'Test2', '"",Test3,Test4""', '"",Test3,Test4""', None, 'Test5']
Desired:
['Test1', 'Test2', '"",Test3,Test4""', 'Test5']
How can I avoid the useless split results?
Edit: I didn't even know about a default CSV module, continued using that. Thanks for you efforts!
Upvotes: 0
Views: 78
Reputation: 1640
(?<!"),(?![^",]+")|,(?=[^"]*$)
Will work for the example you gave, although it won't work if the input differs from that format.
input = 'Test1,Test2,"",Test3,Test4"",Test5'
output = re.split(r'(?<!"),(?![^",]+")|,(?=[^"]*$)', input)
print(output)
# ['Test1', 'Test2', '"",Test3,Test4""', 'Test5']
You should really be using a CSV parser for this. If you can't for some reason - just do some manual string processing, going through character by character and splitting when you see a comma, unless you have recognised you are in a quoted string. Something like the following:
input = 'Test1,Test2,"",Test3,Test4"",Test5'
insideQuoted = False
output = []
lastIndex = 0
for i in range(0, len(input)):
if input[i] == ',' and not insideQuoted:
output.append(input[lastIndex: i])
lastIndex = i + 1
elif input[i] == '"' and i < len(input) - 1 and input[i + 1] == '"':
insideQuoted ^= True
elif i == len(input) - 1:
output.append(input[lastIndex: i + 1])
Upvotes: 1