Reputation: 6132
I have the following code that parses an HTML document with Nokogiri:
td.next_element.text.scan(/\A[^(]+/).first.gsub(/\s+/, " ").strip
There is also a case
statement with a regular expression that has \s+
and isn't catching anything. I tried to use strip
, but it did not do anything.
After testing with the gsub
line above, I figured there was a problem with the way whitespace was encoded. td.next_element.text[-2].ord
returned not 32 as I had expected, but 160. I realized that my document was in UTF-8 and not ASCII, and that 160 was a non-breaking space.
I should just be able to do this, I thought:
case td.text.strip.downcase.gsub(/\xA0|\xC2/, ' ')
Problem is, I get
Encoding::CompatibilityError
(incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string)):
What do I do? Also, aren't regular expressions supposed to match all whitespace, not just ASCII?
Upvotes: 1
Views: 1538
Reputation: 80065
Add the comment #encoding: UTF-8
as the first line of your script; use /[[:space:]]/
to find Unicode whitespace.
Upvotes: 4