hhanzo1
hhanzo1

Reputation: 657

File.readlines invalid byte sequence in UTF-8 (ArgumentError)

I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.

a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

I am trying to get this solution working. I have seen people doing

.encode!('UTF-8', 'UTF-8', :invalid => :replace)

but it doesnt appear to work with File.readlines.

File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)

' : undefined method `encode!' for # (NoMethodError)

Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?

Attempt 1

Tried this but it failed with same invalid byte sequence error.

IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
  # extract three columns: time stamp, url, ip
  s = s.parse_csv;
  { timestamp: s[0], url: s[1], ip: s[3] }
end

Solution

This seems to have worked for me.

a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a

Does Ruby provide a way to do File.read() with specified encoding?

Upvotes: 6

Views: 9914

Answers (2)

markhorrocks
markhorrocks

Reputation: 1538

In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.

I did

File.readlines(email, :encoding => 'UTF-8').each do |line|

but this didn't work with some Japanese characters so I added this on the next line and that worked fine.

line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

Upvotes: 1

7stud
7stud

Reputation: 48649

I am trying to get this solution working. I have seen people doing

   .encode!('UTF-8', 'UTF-8', :invalid => :replace)

but it doesnt appear to work with File.readlines.

File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.

could you please provide an example to the alternative above.

require 'csv'

CSV.foreach("log.csv", encoding: "utf-8") do |row|
  md = row[0].match /watch\?v=/
  puts row[0], row[1], row[3] if md
end

Or,

CSV.foreach("log.csv", 'rb:utf-8') do |row|

If you need more speed, use the fastercsv gem.

This seems to have worked for me.

File.readlines('log.csv', :encoding => 'ISO-8859-1')

Yes, in order to read a file you have to know its encoding.

Upvotes: 7

Related Questions