Reputation: 19090
I’m using Rails 4.2.7. I’m currently using the following logic to parse a doc with Nokogiri:
content.xpath("//pre[@class='text-results']").xpath('text()').to_s
In my HTML document, this content appears within my “text-results” block:
<pre class="text-results"><html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=Title content="<p><a href=http://mychiptime">
<meta name=Keywords content="">
<meta http-equiv=Content-Type content="text/html; charset=macintosh”>…
I include this section because my parsing dies with the following error:
Error during processing: unknown encoding name - macintosh
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `find'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:627:in `serialize'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:786:in `to_format'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:642:in `to_html'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node.rb:512:in `to_s'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:187:in `block in each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `upto'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:186:in `each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `map'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/nokogiri-1.6.8.1/lib/nokogiri/xml/node_set.rb:218:in `to_s'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_service.rb:8:in `pre_process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:77:in `process_my_object_data'
/Users/davea/Documents/workspace/myproject/app/services/onlinerr_my_object_finder_service.rb:82:in `process_my_object_link'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:29:in `block in process_data'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `each'
/Users/davea/Documents/workspace/myproject/app/services/abstract_my_object_finder_service.rb:28:in `process_data'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:18:in `block in run_all_crawlers'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'
/Users/davea/.rvm/gems/ruby-2.3.0/gems/activerecord-4.2.7.1/lib/active_record/relation/delegation.rb:46:in `each'
/Users/davea/Documents/workspace/myproject/app/services/run_crawlers_service.rb:5:in `run_all_crawlers'
Is there any way to make Nokogiri ignore this unknown encoding? I’m trying to get the content inside the <pre>
tag as text, so I don’t need it parsed further.
I'm on Mac El Capitan. Per the comment, here's my locale settings:
davea$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Upvotes: 3
Views: 1543
Reputation: 2293
See Nokogiri, open-uri, and Unicode Characters
When Nokogiri parses a document, it uses the encoding that the document specifies (unless you explicitly tell it what encoding to use).
"macintosh" is not a default Ruby encoding (see Encoding.list
for a list of all encodings Ruby knows).
You can force Nokogiri to use an explicit encoding by passing it as an argument to parse
.
# encoding is guessed from the document
doc = Nokogiri::HTML.parse(File.open('test.html'))
doc.xpath("//pre[@class='text-results']").xpath('text()').to_s
ArgumentError: unknown encoding name - macintosh
# force Nokogiri to parse the document as 'utf-8'
doc = Nokogiri::HTML.parse(File.open('test.html'), nil, 'utf-8')
doc.xpath("//pre[@class='text-results']").xpath('text()').to_s
=> "\n\n\n"
The caveat is that Nokogiri really will parse the content as 'utf-8', meaning if any special characters are encoded using some other encoding (like macintosh), they may become garbled.
Upvotes: 0
Reputation: 160551
Your HTML is invalid. You have a <pre>
tag outside the <body>
and, as a result, Nokogiri is having to do fixups which usually results in questionable results.
This is what Nokogiri has to say about the document:
doc.errors # => [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <html> tag>, #<Nokogiri::XML::SyntaxError: htmlParseStartTag: misplaced <head> tag>, #<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>]
doc.to_html # => "<pre class=\"text-results\">\n\n\n<meta name=\"Title\" content=\"<p><a href=http://mychiptime\">\n<meta name=\"Keywords\" content=\"\">\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”>\n</head>\n\"></pre>"
Looking at only the line in question, it's also confusing Nokogiri:
doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh”>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: AttValue: " expected>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag meta>]
doc.to_html # => "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=macintosh”>\">"
Notice that Nokogiri doesn't recognize a closing curly-quote as a terminator for the string content="text/html; charset=macintosh”
.
You can't fix this within Nokogiri. You'll need to provide the appropriate structure, and need to do a search and replace to convert curly quotes prior to parsing the document. Hopefully the document won't contain them inside the <body>
in text or you'll be altering text which might be a problem for your use.
The fact you have curly-quotes in places they shouldn't exist is curious. If your editor is converting from straight quotes to curly quotes then you need to immediately turn off that feature as it'll cause real havoc with coding. Good text editors for coding won't even offer the use of curly quotes because of the problems they cause.
Nokogiri is complaining about the "macintosh" sequence as far as I can tell.
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse('<meta http-equiv=Content-Type content="text/html; charset=macintosh">')
doc.at('meta')['content'] # => "text/html; charset=macintosh"
If the HTML is clean it doesn't care.
Upvotes: 2