Sig
Sig

Reputation: 5952

Right encoding for CSV with degree symbol

I need to parse a CSV file that contains the degree symbol (°) inside a header. If I try to open the file:

CSV.foreach('myfile.csv', headers: true) do |row|
  ...
end

I get invalid byte sequence in UTF-8 (ArgumentError). So I tried few other encodings (ISO-8859-1 and ASCII-8BIT), but I always get a CSV::MalformedCSVError error.

Which encoding should I specify in order to be able to read the file?

Actually I don't care about the degree symbol, so it works also for me a solution that simply ignores it (and returns for instance 'Tx1 C' instead of 'Tx1 °C').

Upvotes: 1

Views: 1871

Answers (2)

Mark Setchell
Mark Setchell

Reputation: 207560

You could shell out a process to remove the little devils before you open it:

system("LANG=C tr -d '\260' < myfile.csv >> $$.tmp && mv $$.tmp myfile.csv")

The tr -d says to delete character code 260, saving the results to a file named with the process id ($$) and the extension .tmp. If that was successful (&&), it replaces the original file.

You can try the tr command on its own at the shell to test it like this:

LANG=C tr -d '\260' < myfile.csv

If you target Windows, the tr command will not work and you may have to do something like this to remove the first line:

more +1 unhappy.csv > happy.csv

Note that more has a limit of 65535 lines though.

Upvotes: 0

Arie Xiao
Arie Xiao

Reputation: 14082

The default encoding for parsing external files are UTF-8 (Encoding.default_external). However, the CSV file isn't stored in UTF-8. When Ruby tries to parse non-UTF-8 encoded byte sequence using UTF-8 encoding, error arises if the two encoding isn't compatible.

You should first get the actual encoding of your CSV file. This can be determined by open the CSV file in Notepad++ and check the option under the Encoding menu. Some other text editor has similar utility, too, such as VIM, UltraEditor...

Suppose you find the actual encoding of the CSV file is GBK, rewrite your code as

CSV.foreach('myfile.csv', headers: true, encoding: 'GBK') do |row|
 ...
end

Upvotes: 1

Related Questions