ASX
ASX

Reputation: 1725

ruby 1.9 character conversion errors while testing regex

I know there are a tons of docs and debates out there, but still:

This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.

What can I do?

# encoding: utf-8

require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'

url = 'http://www.website.com/url/test'

sio = open(url)
@cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, @cur_encoding)
txtdoc = doc.to_s

# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR


# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"


p /#{regex}/i =~ txtdoc # integer expected

I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?

Upvotes: 0

Views: 433

Answers (2)

matt
matt

Reputation: 79743

The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.

In your first problem, the string:

"Nove36  "

actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:

1.9.3-p125 :001 > s = "Foo \u00a0"
 => "Foo  " 
1.9.3-p125 :002 > s.strip
 => "Foo  "    #unchanged

In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:

# s as before
1.9.3-p125 :003 > s =~ /Foo  / #2 spaces, no match
 => nil 
1.9.3-p125 :004 > s =~ /Foo /  #1 space, match
 => 0 
1.9.3-p125 :005 > s =~ /Foo \u00a0/  #space and non breaking space, match
 => 0

When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.

The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:

sio = open(url)
@cur_encoding = sio.charset

txt = sio.read             #read the whole file
txt.gsub! "\u00a0", " "    #global replace

doc = Nokogiri::HTML(txt, nil, @cur_encoding)   #use this new string instead...

Upvotes: 2

bobince
bobince

Reputation: 536399

@cur_encoding = doc.encoding # ISO-8859-15

ISO-8859-15 is not the correct encoding for the quoted page; it should have been UTF-8. iconving it to UTF-8 as if it were 8859-15 only compounds the problem.

This encoding is coming from a faulty <meta> tag in the document. A browser will ignore that tag and use the overriding encoding from the Content-Type: text/html;charset=utf-8 HTTP response header.

However Nokogiri appears not to be able to read this header from the open()ed stream. With the caveat that I know nothing about Ruby, looking at the source the problem would seem to be that it uses the property encoding from the string-or-IO instead of charset which seems to be what open-uri writes.

You can pass in an override encoding of your own, so I guess try:

sio= open(url)
doc= Nokogiri::HTML.parse(doc, nil, sio.charset) # should be UTF-8?

Upvotes: 3

Related Questions