Codium
Codium

Reputation: 3240

`gsub': incompatible character encodings: UTF-8 and IBM437

I try to use search, google but with no luck.

OS: Windows XP Ruby version 1.9.3po

Error:

`gsub': incompatible character encodings: UTF-8 and IBM437

Code:

require 'rubygems'
require 'hpricot'
require 'net/http'

source = Net::HTTP.get('host', '/' + ARGV[0] + '.asp')


doc = Hpricot(source) 

doc.search("p.MsoNormal/a").each do |a|
  puts a.to_plain_text
end

Program output few strings but when text is ”NOŻYCE” I am getting error above. Could somebody help?

Upvotes: 2

Views: 5272

Answers (2)

gioele
gioele

Reputation: 10205

The inner encoding of the source variable is UTF-8 but that is not what you want.

As tadman wrote, you must first tell Ruby that the actual characters in the string are in the IBM437 encoding. Then you can convert that string to your favourite encoding, but only if such a conversion is possible.

source.force_encoding('IBM437').encode('UTF-8')

In your case, you cannot convert your string to ISO-8859-2 because not all IBM437 characters can be converted to that charset. Sticking to UTF-8 is probably your best option.

Anyway, are you sure that that file is actually transmitted in IBM437? Maybe it is stored as such in the HTTP server but it is sent over-the-wire with another encoding. Or it may not even be exactly in IBM437, it may be CP852, also called MS-DOC Latin 2 (different from ISO Latin 2).

Upvotes: 3

tadman
tadman

Reputation: 211560

You could try converting your HTML to UTF-8 since it appears the original is in vintage-retro DOS format:

source.encode!('UTF-8')

That should flip it from 8-bit ASCII to UTF-8 as expected by the Hpricot parser.

Upvotes: 4

Related Questions