Change UTF-8 spaces to RegEx-able spaces

Question

I have the following code that parses an HTML document with Nokogiri:

td.next_element.text.scan(/\A[^(]+/).first.gsub(/\s+/, " ").strip

There is also a case statement with a regular expression that has \s+ and isn't catching anything. I tried to use strip, but it did not do anything.

After testing with the gsub line above, I figured there was a problem with the way whitespace was encoded. td.next_element.text[-2].ord returned not 32 as I had expected, but 160. I realized that my document was in UTF-8 and not ASCII, and that 160 was a non-breaking space.

I should just be able to do this, I thought:

case td.text.strip.downcase.gsub(/\xA0|\xC2/, ' ')

Problem is, I get

Encoding::CompatibilityError 
  (incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)):

What do I do? Also, aren't regular expressions supposed to match all whitespace, not just ASCII?

Change UTF-8 spaces to RegEx-able spaces

Answers (1)

Related Questions