Reputation: 5952
I need to parse a CSV file that contains the degree symbol (°
) inside a header. If I try to open the file:
CSV.foreach('myfile.csv', headers: true) do |row|
...
end
I get invalid byte sequence in UTF-8 (ArgumentError)
. So I tried few other encodings (ISO-8859-1 and ASCII-8BIT), but I always get a CSV::MalformedCSVError
error.
Which encoding should I specify in order to be able to read the file?
Actually I don't care about the degree symbol, so it works also for me a solution that simply ignores it (and returns for instance 'Tx1 C'
instead of 'Tx1 °C'
).
Upvotes: 1
Views: 1871
Reputation: 207560
You could shell out a process to remove the little devils before you open it:
system("LANG=C tr -d '\260' < myfile.csv >> $$.tmp && mv $$.tmp myfile.csv")
The tr -d
says to delete character code 260
, saving the results to a file named with the process id ($$
) and the extension .tmp
. If that was successful (&&
), it replaces the original file.
You can try the tr
command on its own at the shell to test it like this:
LANG=C tr -d '\260' < myfile.csv
If you target Windows, the tr
command will not work and you may have to do something like this to remove the first line:
more +1 unhappy.csv > happy.csv
Note that more
has a limit of 65535 lines though.
Upvotes: 0
Reputation: 14082
The default encoding for parsing external files are UTF-8 (Encoding.default_external
). However, the CSV file isn't stored in UTF-8. When Ruby tries to parse non-UTF-8 encoded byte sequence using UTF-8 encoding, error arises if the two encoding isn't compatible.
You should first get the actual encoding of your CSV file. This can be determined by open the CSV file in Notepad++ and check the option under the Encoding
menu. Some other text editor has similar utility, too, such as VIM, UltraEditor...
Suppose you find the actual encoding of the CSV file is GBK
, rewrite your code as
CSV.foreach('myfile.csv', headers: true, encoding: 'GBK') do |row|
...
end
Upvotes: 1