What are these strange issues when scraping a web page, maybe encoding?

Question

I'm trying to parse some web pages such as these:

http://www.imovirtual.com/imoveis/apartamentos/t0-t1-entrecampos-mobilado-lisboa/1038329/
http://www.imovirtual.com/imoveis/apartamentos/t2-quinta-do-romao-quarteira/1156717/

I'm using Nokogiri::HTML, and with the first link all is OK, but with the second I only get trash and it's impossible to parse.

I tried using curl, and the result is the same. Here is a sample of the result for the second link:

��� DG;v�u�G{f�
                     ��;?�@ː0t�Yw���`~�d��
f9����:�}P2k�㤷ϓ���togg���B�D�j���P�AS���cV���5h+�dp

What can be the problem? Both pages render nicely in a browser, and I can't find significant differences in their DOM.

Note: using wget on the second link results in an unreadable file.

Nelson Brand&#227;o · Accepted Answer

The webpage is compressed, check the header: Content-Encoding: gzip You need to decompress it.

Edit:

If you are using ruby try this:

cleanHtml = Zlib::GzipReader.new(StringIO.new(htmlCompressed)).read

What are these strange issues when scraping a web page, maybe encoding?

Answers (1)

Related Questions