Reputation: 41
Hello here is my script:
ARGV.each do |input_filename|
doc = Nokogiri::HTML(File.read(input_filename))
title, body = doc.title.gsub("/\s+/"," ").downcase.strip, doc.xpath('//body').inner_text.tr('"', '').gsub("\n", '').downcase.strip
link = doc.search("a[@href]") //Adding this part generates errors
filename = File.basename(input_filename, ".*")
puts %Q("#{title}", "#{body}", "#{filename}", "#{link}").downcase
end
I am having trouble extracting links from a list of html files. I believe the issue is due to unconventional coding in some of the html files. Here is the error i am getting.
extractor.rb:9:in `block in <main>': incompatible character encodings: UTF-8 and CP850 (Encoding::CompatibilityError)
from extractor.rb:4:in `each'
from extractor.rb:4:in `<main>'
Upvotes: 1
Views: 1053
Reputation: 27855
Nokogiri stores Strings always as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings.
You have a conflict UTF-8 and cp850 (you are working with windows?).
You may adapt your File.read(input_filename)
Try
File.read(input_filename, :encoding => 'cp850:utf-8')
If your html-files are windows files.
If your html-files are already utf-8, the try:
File.read(input_filename, :encoding => 'utf-8')
Another solution may be a Encoding.default_external = 'utf-8'
at the begin of your code. (I wouldn't recommend it, use it only for small scripts).
Upvotes: 1
Reputation: 7111
You can go about it a different way using the CSS selector:
doc.css('a').map { |link| link['href'] }
This would search the doc for all anchors and return their href text in an array.
Upvotes: 4