Andrew Latham
Andrew Latham

Reputation: 6132

Change UTF-8 spaces to RegEx-able spaces

I have the following code that parses an HTML document with Nokogiri:

td.next_element.text.scan(/\A[^(]+/).first.gsub(/\s+/, " ").strip

There is also a case statement with a regular expression that has \s+ and isn't catching anything. I tried to use strip, but it did not do anything.

After testing with the gsub line above, I figured there was a problem with the way whitespace was encoded. td.next_element.text[-2].ord returned not 32 as I had expected, but 160. I realized that my document was in UTF-8 and not ASCII, and that 160 was a non-breaking space.

I should just be able to do this, I thought:

case td.text.strip.downcase.gsub(/\xA0|\xC2/, ' ')

Problem is, I get

Encoding::CompatibilityError 
  (incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)):

What do I do? Also, aren't regular expressions supposed to match all whitespace, not just ASCII?

Upvotes: 1

Views: 1538

Answers (1)

steenslag
steenslag

Reputation: 80065

Add the comment #encoding: UTF-8 as the first line of your script; use /[[:space:]]/ to find Unicode whitespace.

Upvotes: 4

Related Questions