AtulBha
AtulBha

Reputation: 41

Ruby extracting links from html

Hello here is my script:

ARGV.each do |input_filename|
  doc = Nokogiri::HTML(File.read(input_filename))
  title, body = doc.title.gsub("/\s+/"," ").downcase.strip, doc.xpath('//body').inner_text.tr('"', '').gsub("\n", '').downcase.strip
  link = doc.search("a[@href]") //Adding this part generates errors
  filename = File.basename(input_filename, ".*")
  puts %Q("#{title}", "#{body}", "#{filename}", "#{link}").downcase
end

I am having trouble extracting links from a list of html files. I believe the issue is due to unconventional coding in some of the html files. Here is the error i am getting.

extractor.rb:9:in `block in <main>': incompatible character encodings: UTF-8 and  CP850 (Encoding::CompatibilityError)
        from extractor.rb:4:in `each'
        from extractor.rb:4:in `<main>'

Upvotes: 1

Views: 1053

Answers (2)

knut
knut

Reputation: 27855

Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.

You have a conflict UTF-8 and cp850 (you are working with windows?). You may adapt your File.read(input_filename)

Try

File.read(input_filename, :encoding => 'cp850:utf-8') 

If your html-files are windows files.

If your html-files are already utf-8, the try:

File.read(input_filename, :encoding => 'utf-8') 

Another solution may be a Encoding.default_external = 'utf-8' at the begin of your code. (I wouldn't recommend it, use it only for small scripts).

Upvotes: 1

ScottJShea
ScottJShea

Reputation: 7111

You can go about it a different way using the CSS selector:

doc.css('a').map { |link| link['href'] }

This would search the doc for all anchors and return their href text in an array.

Upvotes: 4

Related Questions