Reputation: 758
I'm trying to parse some web pages such as these:
http://www.imovirtual.com/imoveis/apartamentos/t0-t1-entrecampos-mobilado-lisboa/1038329/
http://www.imovirtual.com/imoveis/apartamentos/t2-quinta-do-romao-quarteira/1156717/
I'm using Nokogiri::HTML, and with the first link all is OK, but with the second I only get trash and it's impossible to parse.
I tried using curl
, and the result is the same.
Here is a sample of the result for the second link:
��� DG;v�u�G{f�
��;?�@ː0t�Yw���`~�d��
f9����:�}P2k�㤷ϓ���togg���B�D�j���P�AS���cV���5h+�dp
What can be the problem? Both pages render nicely in a browser, and I can't find significant differences in their DOM.
Note: using wget
on the second link results in an unreadable file.
Upvotes: 1
Views: 149
Reputation: 136
The webpage is compressed, check the header: Content-Encoding: gzip You need to decompress it.
Edit:
If you are using ruby try this:
cleanHtml = Zlib::GzipReader.new(StringIO.new(htmlCompressed)).read
Upvotes: 2