Reputation: 6132
I'm working with a .csv
file and, as always, it has format problems. In this case it's a ;
separated table, but there's a row that sometimes has semicolons, like this:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2
So there are three cases:
I turned the .csv
into a .txt
and then imported it as a string and then I compiled this regex:
re.compile('([^\d\W]);\s+([^\d\W])', re.S)
Which should do. I almost managed to replace those semicolons for commas, doing the following:
def replace_comma(match):
text = match.group()
return text.replace(';', ',')
regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)
string2 = string.split('\n')
for n,i in enumerate(string2):
if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
string2[n] = regex.sub(replace_comma, i)
This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0
after the comma. I have two problems with this approach:
\xa0
character ?Do you know any better way to approach this?
Thanks
Edit: My desired output would be:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Edit: Added explanation about turning the file into a string for better manipulation.
Upvotes: 3
Views: 2606
Reputation: 195418
For this case I wouldn't use regex
, split()
and rsplit()
with maxpslit=
parameter is enough:
data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2'''
for line in data.splitlines():
row = line.split(';', maxsplit=1)
row = row[:1] + row[-1].rsplit(';', maxsplit=2)
row[1] = row[1].replace(';', ',')
print(';'.join(row))
Prints:
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Upvotes: 2