Rainer Bärs
Rainer Bärs

Reputation: 61

Scraping website 503 error and output problem

I am a beginner, trying to scrape a web site in Jupyter Notebook for the first time using these tools. Most of my code is based on examples, I can't claim I have a deep understanding...

I'm trying to make a HTTP request in Python + Beautiful Soup to read data from a website to compile some data. At this time I am again getting 503 errors although I have defined a user-agent and trying to handle cookies. I had it working without errors at one point, but then I couldn't decode the output. Now I am back to the 503 error again. Can't remember what I changed, I have been tinkering with this for a while now and Jupyter is no very good saving a change log once you have quit the notebook

Am I doing something seriously wrong, or is the site just good in protecting itself from scraping?

The data is for personal use, I enjoy compiling stuff like this and then building mathematical models and/or deep learning to predict new data. Just for personal fun, adn helpful if I try to sell my car...

The code:

from urllib.request import urlopen as uOpen
from urllib.request import Request as uReq
from urllib.request import build_opener as uOpener
import urllib.parse

import requests
import pprint
import json

from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

from bs4 import BeautifulSoup as soup




my_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }

my_url ='https://www.nettiauto.com/toyota/prius?id_country[]=73&page=1'

try:
    req=uReq(my_url, None, my_headers)
    cj = CookieJar()
    opener = uOpener(urllib.request.HTTPCookieProcessor(cj))
    uClient = opener.open(req)
    data = uClient.read()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)


uClient.close()

encoding = uClient.info().get_content_charset()
page_html = data.decode(encoding, errors='ignore')

page_soup = soup(page_html, "html.parser")

Output before being put into soup:

'\x1f\x08\x00\x00\x00\x00\x00\x00\x03}r۶\x7f?\x05ʶt"QnlK\x1d֤\x1328\x10\tIH%Aj\x7f?Og\x0bMY\x00o(YrlϴĖH\\\x16\x16\x16\x16\x16\rG<yӧh"<\x17yc\x7f4\x1fOΟ\x7f>??yL@\x01f!\x153O_iH\x08\x1f뗗eS~K\n2r&r5uG8`Z\rth8ө\x04V\x1b\x1dOW~A\x07fSjPlFTC\'h`\x07\x01\x19ѫ\x07q;|\x1e[(\x11\x11":k9\x13\'\x1aշ&ȕCȞ $Y4TOa#}!\x1dP\x1fF\x08Aq$x\x18b\x7f30\\/0aN90z\x19Goȡ$$.A\x03\x17XG瑈B\x11A\x13N\t\x0c\t\x1c\\p&0<.0r;ןPĈQ\x12Ҹ\x0b\x0e\'v\x13fe\x00\x01\x1fr\x11`\x1fqU\x1crR\x0bِ\\imlOJ\x1f_GXyiA\x06\x07\x0b\x1bft7ǍF\x1e[GG\x1aǝΓGݕƜ]R\x0b \x19\txi^3\x7f?\x0fu~zѓ/.&fj\\Oyk\x0fO\x021tE=<Γ\nPSxP+2\x19>u:\x12\x01\x03H^-І^]#^\x13Ym{\x19q}o\x15@\x1dkLp`O\x07Աl\x1e1\x11߽7\x1fxLfa\x01SB3\x1bf2*.eS\x14\x10\x13\x1e\x08;\x12ڀI@F\x1bPֽ%\x11a1rYΰ53hK\\C\x12zMa쪦Xƌ3jcwq\x04MZ® \x01Âh#\x0e}3wB; !\x1cTU:br:-ÿ,wm^qRn3\x1b\x1b&\tmF\x10 i`3e*\x0c\x1d\t*\\2Ha\ngN\x04>\x04\x1eUy\x04\x1e#\x0f3<&AV\x129[fi=G\x11<\r\\|Yuaխʇw>w\x0fG4\x16\x1e\n\x1c\x1e#\t\x16\\DS+U2#L\x1cr\x17acp\x034;PK<Dh~ǯGawwt; XH9T\x1d~/9XxN\x0f\x07n=(\x0e/t\x1cΙ\x17AD\x0e/0{%5\x1d\n\r\x02c$YQ@~!}踇#\x01a\x15wNYH\x02x@\x17Qpc|I/\x0e#ZR(+Us0UK?NSrx\x14n\x1f\njO\tvH%]-\x12ӈ3\x11c\x12r;\x0c\x7f\x18akg\x07Mè6\rV[\x01ic\x18V\r\x03j0YK\t!"Gһ,A<ł\x19L\x17\x18v#n^x51i\x12n\x07Cm^/hR"Wrٻu\'bȯ\x1a-r^\n0!\x10\x17\x02iro\x11\t浈3S7\x1b{t\x0fPU`\x1e*t3\x17;/C\x1d{c1KjGv7bl˂,O8#^fBڦC;\t~js``gG ̆qe6/\r+ 2+<ƕn9Z\n\x1d\x06\x148r\x0ckkRZ(:NS\x00~B:vv\x19`\'\x07tyR._\x1bahVֆѫ\x0f}3ݎ\x1ar\x04P\x1b\x03\x0b6g!aG}T2K?\x04!\\|\x04\x0b\\P/\x16V\x19vaA_y$aj@F\x01\t\'OF#̘nݐ\x1cz\x13bO\x1f*\x11*Þ"\x0f)Gc\x161\x16(\x14\x01e\x00,aO˥1\x11"=\x7f\x0cC-UQ\x86/U\x14FM°#\x084#\x10e|Jba:B1\x01N8\x08<IBޙ"\n]\x13~JUT\x12˱C\x1c\x1c$r6|\x08Rz_AQͬ\x10D<K:(^\x14ȵ\x18R\x15^\x15(\x05Jrarw:dMj6Q\nB\x10\x13Pv\x1b\x104.Y\n0)\x1a*BI!嬸eYO\\k2n{\x17\x0e8>{\x0b߱}ܻlL#\'eRA\x1f>}۫ב$/$zlGo^ EG{J*\x15ay\n\x15^P\x10F\x02M.\x12\x1cf\x18A#\x1e\x11d\x0b\x01\\@q\x13rVkU\x04\x1c1\x0f\x15wz\x04(c\x0e!)r5\x073r)IN_@\x13`WU\x0e\r\x16o\x10\x165ؑ\x0e̎\x7fuRFj\x13\x02A\x0e}8\x17\x0f\x07cg$\x7f\x19д\x0c\x0el\x17}\r\x03\x1ev]\x0bd$<0\x174sΩGP]ןOP\x0fVєS\x0c\x05Rn.\x10]\x7f\n\x05\x0eaIq\'\x11r*iH\x08I\x14\tadE\x0ciR\x1b\x04\nP`}uD\x15,{8,\x0b\x17MMϐ_=vyH\r\x7fmn\x06\x1f#ώ3\x1dy1٘sR\x05at\x05.-"ڦ]hvKnhٞU)yuO6)w\x0bv[-\x03F9\\\x7f
<and so on>

Output after soup:

}r۶?ʶt"QnlK
֤28   IH%Aj?Og
MYo(YrlϴĖH\
G<yӧh"<yc4oο>??yL@f!3O_i뗗eS~K
2r&amp;r5uG8`Z
th8өV
OW~AfSjPlFTC'h`ѫq;|
[(":k9'շ&amp;ȕCȞ $Y4TOa#}!
PAq$xb30\/0aN90zGoȡ$$.AXG瑈BAN   
    
\p&amp;0&lt;.0r;ןPĈQҸ
'vfer`qU
rR
ِ\imlOJ_GXyiA
ft7ǍF
[GGǝΓGݕƜ]R
   xi^3?u~zѓ/.&amp;fj\OykO1tE=&lt;Γ
PSxP+2&gt;u:H^-І^]#^Ym{q}o@
kLp`OԱl
1߽7xLfaSB3f2*.eS
;ڀI@FPֽ%a1rYΰ53hK\CzMa쪦Xƌ3jcwqMZ® Âh#}3wB; !
TU:br:-ÿ,wm^qRn3&amp; mF i`3e*

I'm getting somewhat frustrated and confused here. Would anyone have an insight on what's wrong?

Appreciating all help, thanks.

Upvotes: 0

Views: 3322

Answers (1)

JarDan
JarDan

Reputation: 11

I am attempting a similar exercise as you on nettiauto.com, and found out that the website uses CloudFlare's technology to prevent unwanted bot traffic. As a result, by using 'requests' to pull any one of specific pages on their domain, I got the same 503 error.

The solution to this 503 error is to use the cloudscraper library available for Python (https://pypi.org/project/cloudscraper/). You can find more information on the site about how it works and so on.

Upvotes: 1

Related Questions