Joshua Yonathan
Joshua Yonathan

Reputation: 487

CSV reader picks up garbage in the first few characters

I am trying to read the first line of a CSV file and assign it to header. The CSV file looks like this:

TIME,DAY,MONTH,YEAR
"3:21","23","FEB","2018"
"3:23","23","FEB","2018"
...

Here is the code:

import csv

with open("20180223.csv") as csvfile:
    rdr = csv.reader(csvfile)
    header = next(rdr)
    print(header)

I expect the output to look like:

['TIME', 'DAY', 'MONTH', 'YEAR']

However the output looks like this:

['TIME', 'DAY', 'MONTH', 'YEAR']

What did I miss?

Upvotes: 18

Views: 8051

Answers (2)

Heitor
Heitor

Reputation: 692

In PHP you can do this to get rid of this Byte Order Mark, since you know for sure it exists:

$ss = substr(file_get_contents('/path/to/file.csv'), 3);

Upvotes: 0

sjw
sjw

Reputation: 6543

That first character is the Byte order mark character.

Try this:

with open("20180223.csv", encoding="utf-8-sig") as csvfile:

This advice is somewhat hidden away in the documentation, but it is there:

In some areas, it is also convention to use a “BOM” at the start of UTF-8 encoded files; the name is misleading since UTF-8 is not byte-order dependent. The mark simply announces that the file is encoded in UTF-8. Use the ‘utf-8-sig’ codec to automatically skip the mark if present for reading such files.

Upvotes: 38

Related Questions