Reputation: 13
ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?
Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.
The code:
require 'csv'
require 'nokogiri'
require 'open-uri'
CSV_OPTIONS = {
:write_headers => true,
:headers => %w[url desc]
}
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls') do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
end
puts "scraping done!"
The error message:
/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
from bbb.rb:13:in `block (2 levels) in <main>'
from bbb.rb:11:in `foreach'
from bbb.rb:11:in `block in <main>'
from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from bbb.rb:10:in `<main>'
Upvotes: 0
Views: 1589
Reputation: 196
I'm a bit late to the party here, but this should work for anyone running into the same issue in the future: csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)
Upvotes: 1
Reputation: 1171
Two things:
You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls
The issue seems to be the encoding of the file listOfURLs.xls
, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.
You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.
If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:
f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
puts row
end
Some good info about invalid byte sequences in UTF-8
Update:
An example:
CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
URI.parse(URI.encode(url.chomp))
begin
page = Nokogiri.HTML(open(url))
page.css('.bio media-content').each do |scrape|
desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace)
csv << [url, desc]
end
end
end
Upvotes: 2