Juan C
Juan C

Reputation: 6132

Replacing semicolon for comma in csv using regex in python

I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2

So there are three cases:

I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:

re.compile('([^\d\W]);\s+([^\d\W])', re.S)

Which should do. I almost managed to replace those semicolons for commas, doing the following:

def replace_comma(match):
    text = match.group()
    return text.replace(';', ',')

regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)

string2 = string.split('\n')

for n,i in enumerate(string2):
    if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
        string2[n] = regex.sub(replace_comma, i)

This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:

Do you know any better way to approach this?

Thanks

Edit: My desired output would be:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2

Edit: Added explanation about turning the file into a string for better manipulation.

Upvotes: 3

Views: 2606

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:

data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2'''

for line in data.splitlines():
    row = line.split(';', maxsplit=1)
    row = row[:1] + row[-1].rsplit(';', maxsplit=2)
    row[1] = row[1].replace(';', ',')
    print(';'.join(row))

Prints:

1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2

Upvotes: 2

Related Questions