Ruby CSV library read from file: Determining source file encoding to be provided to foreach method

Question

I was facing a problem regarding parsing a CSV file posted here: Ruby unable to parse a CSV file: CSV::MalformedCSVError (Illegal quoting in line 1.)

The problem was resolved with the guidance received from @Anand.Posting here the answer for reference in case it helps others.

My file(/tmp/my_data.csv) used in sample code shown in above referred post actually contained the Unicode characters named BOM (Byte Order Mark) at the start of the file .I received the file from a client so don't know how they got in there.

Using @Anand's suggestion of using following code:

in case of string

sub!(/^\xEF\xBB\xBF/, '')

or in case of a file starting with BOM characters worked

CSV.foreach("test.csv", encoding: "bom|utf-8")

worked.

However this raises another question that is there a way I can detect problems in CSV file like containing specific encoded characters which is different from standard Unicode? In other words my file contained BOM unicode characters.Could those be detected earlier and fixed before I start reading the file? Is there a way out? I mean the solution shown above required the encoding the source file was in and the target encoding.So there should be a way to determine the source encoding or such characters being present in the source file to be read.If anybody have any idea on this please provide your inputs.

Thanks, Jignesh

Patrick Oscity · Accepted Answer

Using 'bom|utf-8' the BOM will be removed. It doesn't matter whether the file actually has a BOM or not so you're safe with this option. From the Ruby documentation:

If ext_enc starts with 'BOM|', check whether the input has a BOM. If there is a BOM, strip it and set external encoding as what the BOM tells.

Ruby CSV library read from file: Determining source file encoding to be provided to foreach method

Answers (1)

Related Questions