user3064538
user3064538

Reputation:

Ignore UTF-8 decoding errors in CSV

I have a CSV file (which I have no control over). It's the result of concatenating multiple CSV files. Most of the file is UTF-8 but one of the files that went into it had fields that are encoded in what looks like Windows-1251.

I actually only care about one of the fields which contains a URL (so it's valid ASCII/UTF-8).

How do I ignore decoding errors in the other CSV fields if I only care about one field which I know is ASCII? Alternatively, for a more useful solution how do I change the encoding for each line of a CSV file if there's an encoding error?

Upvotes: 1

Views: 2826

Answers (2)

Harvie.CZ
Harvie.CZ

Reputation: 93

I've found even shorter solution, just open the file with errors='ignore' (and possibly forcing correct encoding as well), effectively discarding characters that cannot be decoded:

csv_file = open(sys.argv[2], encoding='utf-8', errors='ignore')
csv_reader = csv.reader(csv_file, delimiter=',')

Upvotes: 0

user3064538
user3064538

Reputation:

csv.reader and csv.DictReader take lists of strings (a list of lines) as input, not just file objects.

So, open the file in binary mode (mode="rb"), figure out the encoding of each line, decode the line using that encoding and append it to a list and then call csv.reader on that list.

One simple heuristic is to try to read each line as UTF-8 and if you get a UnicodeDecodeError, try decoding it as the other encoding. We can make this more general by using the chardet library (install it with pip install chardet) to guess the encoding of each line if you can't decode it as UTF-8, instead of hardcoding which encoding to fall back on:

import codec

my_csv = "some/path/to/your_file.csv"

lines = []
with open(my_csv, "rb") as f:
    for line in f:
        detected_encoding = chardet.detect(line)["encoding"]
        try:
            line = line.decode("utf-8")
        except UnicodeDecodeError as e:
            line = line.decode(detected_encoding)
        lines.append(line)

reader = csv.DictReader(lines)
for row in reader:
    do_stuff(row)

If you do want to just hardcode the fallback encoding and don't want to use chardet (there's a good reason not to, it's not always accurate), you can just replace the variable detected_encoding with "Windows-1251" or whatever encoding you want in the code above.

This is of course not perfect because just because a line successfully decodes using some encoding doesn't mean it's actually using that encoding. If you don't have to do this more than a few times, it's better to do something like print out each line and its detected encoding and try and figure out where one encoding starts and the other ends by hand. Ultimately the right strategy to pursue here might be to try and reverse the step that lead to the broken input (concatenating of the the files) and then re-do it correctly (by normalizing them to the same encoding before concatenating).

In my case, I counted how many lines were detected as which encoding

import chardet
from collections import Counter

my_csv_file = "some_file.csv"
with open(my_csv_file, "rb") as f:
   encodings = Counter(chardet.detect(line)["encoding"] for line in f)

print(encodings)

and realized that my whole file was actually encoded in some other, third encoding. Running chardet on the whole file detected the wrong encoding, but running it on each line detected a bunch of encodings and the second most common one (after ascii) was the correct encoding I needed to use to read the whole file. So ultimately all I needed was

with open(my_csv, encoding="latin_1") as f:
    reader = csv.DictReader(f)
    for row in reader:
        do_stuff(row)

You could try using the Compact Encoding Detection library instead of chardet. It's what Google Chrome uses so maybe it'll work better, but it's written in C++ instead of Python.

Upvotes: 1

Related Questions