JammyStressford
JammyStressford

Reputation: 13

Ruby/Nokogiri site scraping - invalid byte sequence in UTF-8 (ArgumentError)

ruby n00b here. I'm trying to scrape one p tag from each of the URLs stored in a CSV file, and output the scraped content and its URL to a new file (myResults.csv). However, I keep getting a 'invalid byte sequence in UTF-8 (ArgumentError)' error, which is suggesting the URLs are not valid? (they are all standard 'http://www.exmaple.com/page' and work in my browser)?

Have tried .parse and .encode from similar threads on here, but no luck. Thanks for reading.

The code:

require 'csv'
require 'nokogiri'
require 'open-uri'

CSV_OPTIONS = {
  :write_headers => true,
  :headers => %w[url desc]
}

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
  csv_doc = File.foreach('listOfURLs.xls') do |url|
    URI.parse(URI.encode(url.chomp))
    begin
    page = Nokogiri.HTML(open(url))
      page.css('.bio media-content').each do |scrape|
      desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
      csv << [url, desc]

    end
  end
end
end

puts "scraping done!"

The error message:

/Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:304:in `escape'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/uri/common.rb:623:in `escape'
    from bbb.rb:13:in `block (2 levels) in <main>'
    from bbb.rb:11:in `foreach'
    from bbb.rb:11:in `block in <main>'
    from /Users/oli/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from bbb.rb:10:in `<main>'

Upvotes: 0

Views: 1589

Answers (2)

SickLickWill
SickLickWill

Reputation: 196

I'm a bit late to the party here, but this should work for anyone running into the same issue in the future: csv_doc = IO.read(file).force_encoding('ISO-8859-1').encode('utf-8', replace: nil)

Upvotes: 1

Jacob Rastad
Jacob Rastad

Reputation: 1171

Two things:

  1. You say that the URLs are stored in a CSV file but you reference an Excel-file in your code listOfURLs.xls

  2. The issue seems to be the encoding of the file listOfURLs.xls, ruby assumes that the file is UTF-8 encoded. If the file is not UTF-8 encoded or contains non valid UTF-8 characters you can get that error.

    You should double check that the file is encoded in UTF-8 and doesn't contain any illegal characters.

    If you must open a file that is not UTF-8 encoded, try this for ISO-8859-1:

    f = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |row|
        puts row
    end
    

Some good info about invalid byte sequences in UTF-8

Update:

An example:

CSV.open('myResults.csv', 'wb', CSV_OPTIONS) do |csv|
    csv_doc = File.foreach('listOfURLs.xls', {encoding: "iso-8859-1"}) do |url|
        URI.parse(URI.encode(url.chomp))
        begin
        page = Nokogiri.HTML(open(url))
            page.css('.bio media-content').each do |scrape|
            desc = scrape.at_css('p').text.encode!('UTF-8', 'UTF-8', :invalid => :replace) 
            csv << [url, desc]

        end
    end
end

Upvotes: 2

Related Questions