rigyt
rigyt

Reputation: 2353

Can't read file charset utf-16le except puts in ruby

I need to read an external file in ruby. Running file -i locally shows text/plain; charset=utf-16le

I open it in ruby CSV with separater '\t' and a row shows as: <CSV::Row "\xFF\xFEC\x00a\x00n\x00d\x00i\x00d\x00a\x00t\x00e\x00 \x00n\x00u\...

row.to_s produces \x000\x000\x000\x001\x00\t\x00E\x00D\x00O

Running puts row shows the data correctly: 0001 EDOARDO A... (the values also show legibly in vim and LibreOffice Calc)

Any suggestions how to get the data in ruby? I've tried various combinations of opening the CSV with external_encoding: 'utf-16le', internal_encoding: "utf-8" etc., but puts is the only thing that gives legible values

It also said ASCII-8BIT in ruby CSV. <#CSV io_type:StringIO encoding:ASCII-8BIT lineno:0 col_sep:"\\t" row_sep:"\n" quote_char:"\"" headers:true>

The file itself was produced as an XLS file. I have uploaded an edited version here (edited i gvim)

Upvotes: 0

Views: 277

Answers (2)

rigyt
rigyt

Reputation: 2353

The issue was that I was reading from a Paperclip attachment, which needed to have the encoding set (overridden) before saving.

Adding s3_headers in the model worked:

 has_attached_file :attachment, s3_headers: lambda { |attachment|
                                  { 
                                    'content-Type' => 'text/csv; charset=utf-16le'
                                  }
                                }

Thanks to Julien for tipping me off that the issue was related to the paperclip attachment (that solution works to read the file directly)

Upvotes: 0

Julien
Julien

Reputation: 2319

This is working fine for me:

require 'csv'

CSV.foreach("file.xls", encoding: "UTF-16LE:UTF-8", col_sep: "\t") do |row|
  puts row.inspect
end

this will produce the following output:

["Candidate number", "First name", "Last name", "Date of birth", "Preparation centre", "Result", "Score", "Reading and Writing", "Listening", "Speaking", "Result enquiry", "Raised on", "Raised by", "Enquiry status", "Withdrawn on", "Withdrawn by", nil]
["0001", "EDOARDO", "AGNEW", "20/01/2001", "Fondazione Istituto Massimo", "RY5-G8-Y2", "-", nil, nil, nil, "-", "00000000", nil, nil, "00000000", nil, nil]

As you can see each row is an array of strings of each column in the document.

Upvotes: 1

Related Questions