Reputation: 657
I am processing a file which contains data from the web and encounter invalid byte sequence in UTF-8 (ArgumentError) error on certain log files.
a = File.readlines('log.csv').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines
.
File.readlines('log.csv').encode!('UTF-8', 'UTF-8', :invalid => :replace).grep(/watch\?v=/)
' : undefined method `encode!' for # (NoMethodError)
Whats the most straightforward way to filter/convert invalid UTF-8 characters during a File read?
Attempt 1
Tried this but it failed with same invalid byte sequence error.
IO.foreach('test.csv', 'r:bom|UTF-8').grep(/watch\?v=/).map do |s|
# extract three columns: time stamp, url, ip
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
Solution
This seems to have worked for me.
a = File.readlines('log.csv', :encoding => 'ISO-8859-1').grep(/watch\?v=/).map do |s|
s = s.parse_csv;
{ timestamp: s[0], url: s[1], ip: s[3] }
end
puts a
Does Ruby provide a way to do File.read() with specified encoding?
Upvotes: 6
Views: 9914
Reputation: 1538
In my case the script defaulted to US-ASCII and I wasn't at liberty to change it on the server for risk of other conflicts.
I did
File.readlines(email, :encoding => 'UTF-8').each do |line|
but this didn't work with some Japanese characters so I added this on the next line and that worked fine.
line = line.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Upvotes: 1
Reputation: 48649
I am trying to get this solution working. I have seen people doing
.encode!('UTF-8', 'UTF-8', :invalid => :replace)
but it doesnt appear to work with File.readlines.
File.readlines returns an Array. Arrays don't have an encode method. On the other hand, strings do have an encode method.
could you please provide an example to the alternative above.
require 'csv'
CSV.foreach("log.csv", encoding: "utf-8") do |row|
md = row[0].match /watch\?v=/
puts row[0], row[1], row[3] if md
end
Or,
CSV.foreach("log.csv", 'rb:utf-8') do |row|
If you need more speed, use the fastercsv gem.
This seems to have worked for me.
File.readlines('log.csv', :encoding => 'ISO-8859-1')
Yes, in order to read a file you have to know its encoding.
Upvotes: 7