Python: BeautifulSoup returning garbage

Question

I am building a basic data crawler in python using BeautifulSoup, for Batoto, the manga host. For some the reason, the URL works sometimes and other times it doesn't. For example:

from bs4 import BeautifulSoup
from urllib2 import urlopen

x= urlopen(*manga url here*)
y = BeautifulSoup(x)

print y

The result should be a tag soup of the page but instead I get a big wall of this

´ºŸ{›æP™oRhtüs2å÷%ëmßñ6Y›þ�GDŸ0ËÂÍ‡ì¼®Yé)–ÀØÅð&ô]½f³ÓÞ€Þþ)ú$÷á�üv…úzW¿¾úà†lªÀí¥ï«·_    OTL_ˆêsÁÿƒÁÖ<Ø?°Þ›Â+WLç¥àEh>rýÜ>x    ˆ‡eÇžù»èå»–Ùý e:›§`L_.‹¦úoÓ‘®e=‰ìÓ4Wëo’]~Ãõ¬À8>x:²âœ2¸ Á|&0ÍVpMLÎñ»v¥Ín÷-ÅÃ‰–T§`Ì.SÔsóë„œ¡×[˜·P6»�ùè�>Ô¾È]Œ—·ú£âÊgí%Ø¶kwýÃ=ÜÏ¸2cïÑfÙ_�×]Õê“ž?„UÖ* m³/`ñ§ÿL0³dµ·jªÅ}õ/õOXß×;«]®’Ï¯w‹·þ¡ÿ|Gýª`I{µœ}œí�ë–¼yÖÇ'�Wç�ëµÅþþ*ýœd{ÿDv:Ð íHzqÿÆ÷æélG-èÈâpÇßQé´^ÐO´®Xÿ�ýö(‹šëñþ"4!SÃõ2{òÿÜ´»ûE

wrapped in html and body tags.

Sometimes I will keep trying and it works, but it is so inconsistent, I can't figure out the reason for it.

Any help would be appreciated.

Padraic Cunningham · Accepted Answer

It seems to be urlopen having issues with encoding, requests works fine:

x = requests.get("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")
y = BeautifulSoup(x.content)    
print y






Rakudai Kishi no Eiyuutan - Scanlations - Comic - Comic Directory - Batoto -    Batoto
.................

Using urlopen we get the following:

x = urlopen("http://bato.to/comic/_/comics/rakudai-kishi-no-eiyuutan-r11615")    
print x.read()


���������s+I���2���l��9C<�� ^�����쾯�dw�xzNT%��,T��A^�ݫ���9��a��E�C���W!�����ڡϳ��f7���s2�Px$���}I�*�'��;'3O>���'g?�u®{����e.�ڇ�e{�u���jf:aث
�����DS��%��X�Zͮ���������9�:�Dx�����\-�
�*tBW������t�I���GQ�=�c��\:����u���S�V(�>�œ��gǿ�o�OE3jçCV<`���Q!��5�B��N��Ynd����?~��q���� _G����;T�S'�@΀��t��Ha�.;J�61'`Й�@���>>`��Z�ˠ�x�@� J*u��'���-����]p�9{>����������#�<-~�K"[AQh0HjP
0^��R�]�{N@��
 ...................

So as you can see it is a problem with urlopen not BeautifulSoup.

Python: BeautifulSoup returning garbage

Answers (2)

Related Questions