Doolan
Doolan

Reputation: 1641

Invalid Byte Sequence in UTF-8 from Excel file

(Ruby 2.5) I have a method that reads and parses a csv file that's being uploaded via Alchemy CMS

def process_csv(csv_file, current_user_id, original_filename)
    lock_importer

    errors           = []
    index            = 0
    string_converter = lambda { |field| field.strip }
    total            = CSV.foreach(csv_file, headers: true).count
    csv_string = csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)

    CSV.parse(csv_string, headers: true, header_converters: :symbol, skip_blanks: true, converters: [string_converter] ) do |row|
    # do other stuff
end

but when I try to upload a csv file that has a column (name) with a string that contains special characters then I receive the Invalid Byte Sequence in UTF-8 error. I'm trying to test the value N'öt Réal Stô'rë.

I've tried a few solutions that I found on the web but no luck - any suggestions?

Upvotes: 0

Views: 1300

Answers (1)

knut
knut

Reputation: 27885

It's unclear what your csv_fileis. I guess it is a File-object.

Sometimes I got csv from Excel as a UTF-16. So let's try an example:

I have a csv-file stored in UTF-16BE with the following content:

line;comment;UmlautÄ
1;Das ist UTF-16 BE;Ä
2;öüäÖÄÜ;Ä

If I execute the following code:

require 'csv'
def process_csv(csv_file)
    csv_string = csv_file.read#.encode!("UTF-8", "iso-8859-1", invalid: :replace)
    CSV.parse(csv_string, headers: true, skip_blanks: true, col_sep: ';') do |row|
      p row # do other stuff
    end
end

process_csv(File.open('example_utf16BE.txt'))

then I get also a Invalid byte sequence in UTF-8-error.

If I use

process_csv(File.open('example_utf16BE.txt', 'rb', encoding: 'BOM|utf-16BE'))

then everything works.

So I guess, you get a File-object in a wron encoding and the code csv_file.read.encode!("UTF-8", "iso-8859-1", invalid: :replace) is a code part to repair this problem.

What you can do:

Add to you code:

    p csv_file
    p csv_file.external_encoding

You should get

#<File:example_utf16BE.txt>
#<Encoding:UTF-16BE>

Now check, if the file (in my example: example_utf16BE.txt has really the encoding of the 2nd line.

If not, try to adapt the File-object creation. If this is not possible, then you can try to use csv_file.set_encoding 'utf-8' to change the encoding before you read the content.

Upvotes: 0

Related Questions