Srpic
Srpic

Reputation: 450

CSV - how to fix encoding issues

I have a csv file, which says it has UTF-8 encoding (as per Notepad++), but it's obviously not correct and when I try to decode it using other typical encoding, it does not work and it is still not readable.

I just wanted to ask for an advice, if there is a way, how to fix the encoding issues? I've also tried to detect the econding via Python, but it says UTF-8 as well

with open("data.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))

# check what the character encoding might be
print(result)

enter image description here

Example:

enter image description here

Location
Switzerland » Lake Geneva » Vésenaz
Germany » Bönningstedt
Switzerland » Lake of Zurich » Stäfa ZH
Denmark » Svendborg
Germany » Bayern » München
Switzerland » Lake Constance » Uttwil
Switzerland » Neuenburgersee » Yvonand 
Denmark » Svendborg
Germany » Bayern » Boote+service Oberbayern
Italy » Dormelletto 
Switzerland » Seengen
Switzerland » Lake of Zurich » Stäfa am Zürichsee
Italy » Lake Garda » Moniga del Garda (BS)
Switzerland » Zugersee » Neuheim
Switzerland » Vierwaldstättersee » 6004
Switzerland » Safenwil
Switzerland » Lake Constance » Uttwil
Denmark » Svendborg
"France » MARTGUES, MARTIGUES"
Germany » Bayern » Forchheim/Ofr.
Germany » Bayern » München
Switzerland » Luganersee » Caslano
Germany » Nordrhein-Westfalen » WSC Hopp / Mönchengladbach
"Germany » BOOTSSERVICE ENK IN TREIS KARDEN, BOOTSSERVICE ENK"

Upvotes: 0

Views: 178

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149185

The file is correctly encode as UTF-8. But you later display it as UTF-8 on a system that would expect Latin1 or cp1252 encoding.

Here is an evidence:

t = '''Location
Switzerland » Lake Geneva » Vésenaz
Germany » Bönningstedt
Switzerland » Lake of Zurich » Stäfa ZH
Denmark » Svendborg
Germany » Bayern » München
Switzerland » Lake Constance » Uttwil
Switzerland » Neuenburgersee » Yvonand 
Denmark » Svendborg
Germany » Bayern » Boote+service Oberbayern
Italy » Dormelletto 
Switzerland » Seengen
Switzerland » Lake of Zurich » Stäfa am Zürichsee
Italy » Lake Garda » Moniga del Garda (BS)
Switzerland » Zugersee » Neuheim
Switzerland » Vierwaldstättersee » 6004
Switzerland » Safenwil
Switzerland » Lake Constance » Uttwil
Denmark » Svendborg
"France » MARTGUES, MARTIGUES"
Germany » Bayern » Forchheim/Ofr.
Germany » Bayern » München
Switzerland » Luganersee » Caslano
Germany » Nordrhein-Westfalen » WSC Hopp / Mönchengladbach
"Germany » BOOTSSERVICE ENK IN TREIS KARDEN, BOOTSSERVICE ENK"
'''
print(t.encode('latin1').decode())

On my unicode enabled systems it gives as expected:

Location
Switzerland » Lake Geneva » Vésenaz
Germany » Bönningstedt
Switzerland » Lake of Zurich » Stäfa ZH
Denmark » Svendborg
Germany » Bayern » München
Switzerland » Lake Constance » Uttwil
Switzerland » Neuenburgersee » Yvonand 
Denmark » Svendborg
Germany » Bayern » Boote+service Oberbayern
Italy » Dormelletto 
Switzerland » Seengen
Switzerland » Lake of Zurich » Stäfa am Zürichsee
Italy » Lake Garda » Moniga del Garda (BS)
Switzerland » Zugersee » Neuheim
Switzerland » Vierwaldstättersee » 6004
Switzerland » Safenwil
Switzerland » Lake Constance » Uttwil
Denmark » Svendborg
"France » MARTGUES, MARTIGUES"
Germany » Bayern » Forchheim/Ofr.
Germany » Bayern » München
Switzerland » Luganersee » Caslano
Germany » Nordrhein-Westfalen » WSC Hopp / Mönchengladbach
"Germany » BOOTSSERVICE ENK IN TREIS KARDEN, BOOTSSERVICE ENK"

Said differently, the file is correct, it is correctly read by read_csv. Only the next part is plain wrong. You are not using Excel, are you?

Upvotes: 2

Related Questions