ruby 1.9 character conversion errors while testing regex

Question

I know there are a tons of docs and debates out there, but still:

This is my best shot on my Rails attempt to test scraped data from various websites. Strange fact is that if I manually copy-paste the source of an URL everything goes right.

What can I do?

# encoding: utf-8

require 'rubygems'
require 'iconv'
require 'nokogiri'
require 'open-uri'
require 'uri'

url = 'http://www.website.com/url/test'

sio = open(url)
@cur_encoding = sio.charset
doc = Nokogiri::HTML(sio, nil, @cur_encoding)
txtdoc = doc.to_s

# 1) String manipulation test
p doc.search('h1')[0].text # "Nove36  "
p doc.search('h1')[0].text.strip! # nil <- ERROR


# 2) Regex test
# txtdoc = "test test 44.00 € test test" # <- THIS WORKS
regex = "[0-9.]+ €"


p /#{regex}/i =~ txtdoc # integer expected

I realize that probably my OS Ubuntu plus my text editor is doing some good encoding conversion over probably some broken encoding: that's fine, BUT how can I fix this problem on my app while running live?

matt · Accepted Answer

The problems you're having are caused by non breaking space characters (Unicode U+00A0) in the page.

In your first problem, the string:

"Nove36  "

actually ends with U+00A0, and String#strip! doesn't consider this character to be whitespace to be removed:

1.9.3-p125 :001 > s = "Foo \u00a0"
 => "Foo  " 
1.9.3-p125 :002 > s.strip
 => "Foo  "    #unchanged

In your second problem, the space between the price and the euro sign is again a non breaking space, so the regex simply doesn't match as it is looking for a normal space:

# s as before
1.9.3-p125 :003 > s =~ /Foo  / #2 spaces, no match
 => nil 
1.9.3-p125 :004 > s =~ /Foo /  #1 space, match
 => 0 
1.9.3-p125 :005 > s =~ /Foo \u00a0/  #space and non breaking space, match
 => 0

When you copy and paste the source, the browser probably normalises the non breaking spaces, so you only copy normal space character, which is why it works that way.

The simplest fix would be to do a global substitution of \u00a0 for space before you start processing:

sio = open(url)
@cur_encoding = sio.charset

txt = sio.read             #read the whole file
txt.gsub! "\u00a0", " "    #global replace

doc = Nokogiri::HTML(txt, nil, @cur_encoding)   #use this new string instead...

ruby 1.9 character conversion errors while testing regex

Answers (2)

Related Questions