Ruby 1.9 regex encoding

Question

I am parsing this feed http://www.sixapart.com/labs/update/developers/ with nokogiri and then running some regex on the contents of some tags. The content is UTF-8 mostly, but is occasionally corrupt. However, for my case I don't really care and just need to pass the right parts of the content through, so I'm happy to treat the data as binary/ASCII-8BIT. The problem is that no matter what I do, regexes in my script are treated as either UTF-8 or ASCII. No matter what I set the encoding comment to, or what I do to create the regex.

Is there a solution to this? Can I force the regex to binary? Can I do a gsub without a regex easily? (I am just replacing & with &)

Carlos D · Accepted Answer

You need to encode the initial string and use the FIXEDENCODING option.

1.9.3-head :018 > r = Regexp.new("chars".force_encoding("binary"), Regexp::FIXEDENCODING)
=> /chars/
1.9.3-head :019 > r.encoding
=> #

Ruby 1.9 regex encoding

Answers (2)

Related Questions