Scraping website 503 error and output problem

Question

I am a beginner, trying to scrape a web site in Jupyter Notebook for the first time using these tools. Most of my code is based on examples, I can't claim I have a deep understanding...

I'm trying to make a HTTP request in Python + Beautiful Soup to read data from a website to compile some data. At this time I am again getting 503 errors although I have defined a user-agent and trying to handle cookies. I had it working without errors at one point, but then I couldn't decode the output. Now I am back to the 503 error again. Can't remember what I changed, I have been tinkering with this for a while now and Jupyter is no very good saving a change log once you have quit the notebook

Am I doing something seriously wrong, or is the site just good in protecting itself from scraping?

The data is for personal use, I enjoy compiling stuff like this and then building mathematical models and/or deep learning to predict new data. Just for personal fun, adn helpful if I try to sell my car...

The code:

from urllib.request import urlopen as uOpen
from urllib.request import Request as uReq
from urllib.request import build_opener as uOpener
import urllib.parse

import requests
import pprint
import json

from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

from bs4 import BeautifulSoup as soup




my_headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }

my_url ='https://www.nettiauto.com/toyota/prius?id_country[]=73&page=1'

try:
    req=uReq(my_url, None, my_headers)
    cj = CookieJar()
    opener = uOpener(urllib.request.HTTPCookieProcessor(cj))
    uClient = opener.open(req)
    data = uClient.read()
except urllib.request.HTTPError as inst:
    output = format(inst)
    print(output)


uClient.close()

encoding = uClient.info().get_content_charset()
page_html = data.decode(encoding, errors='ignore')

page_soup = soup(page_html, "html.parser")

Output before being put into soup:

'\x1f\x08\x00\x00\x00\x00\x00\x00\x03}r۶\x7f?\x05ʶt"QnlK\x1d֤\x1328\x10	IH%Aj\x7f?Og\x0bMY\x00o(YrlϴĖH\\x16\x16\x16\x16\x16
G??yL@\x01f!\x153O_iH\x08\x1f뗗eS~K
2r&r5uG8`Z
th8ө\x04V\x1b\x1dOW~A\x07fSjPlFTC\'h`\x07\x01\x19ѫ\x07q;|\x1e[(\x11\x11":k9\x13\'\x1aշ&ȕCȞ $Y4TOa#}!\x1dP\x1fF\x08Aq$x\x18b\x7f30\/0aN90z\x19Goȡ$$.A\x03\x17XG瑈B\x11A\x13N	\x0c	\x1c\p&0<.0r;ןPĈQ\x12Ҹ\x0b\x0e\'v\x13fe\x00\x01\x1fr\x11`\x1fqU\x1crR\x0bِ\imlOJ\x1f_GXyiA\x06\x07\x0b\x1bft7ǍF\x1e[GG\x1aǝΓGݕƜ]R\x0b \x19	xi^3\x7f?\x0fu~zѓ/.&fj\Oyk\x0fO\x021tE=<Γ
PSxP+2\x19>u:\x12\x01\x03H^-І^]#^\x13Ym{\x19q}o\x15@\x1dkLp`O\x07Աl\x1e1\x11߽7\x1fxLfa\x01SB3\x1bf2*.eS\x14\x10\x13\x1e\x08;\x12ڀI@F\x1bPֽ%\x11a1rYΰ53hK\C\x12zMa쪦Xƌ3jcwq\x04MZ® \x01Âh#\x0e}3wB; !\x1cTU:br:-ÿ,wm^qRn3\x1b\x1b&	mF\x10 i`3e*\x0c\x1d	*\2Ha
gN\x04>\x04\x1eUy\x04\x1e#\x0f3<&AV\x129[fi=G\x11<
\|Yuaխʇw>w\x0fG4\x16\x1e
\x1c\x1e#	\x16\DS+U2#L\x1cr\x17acp\x034;PK{\x0b߱}ܻlL#\'eRA\x1f>}۫ב$/$zlGo^ EG{J*\x15ay
\x15^P\x10F\x02M.\x12\x1cf\x18A#\x1e\x11d\x0b\x01\@q\x13rVkU\x04\x1c1\x0f\x15wz\x04(c\x0e!)r5\x073r)IN_@\x13`WU\x0e
\x16o\x10\x165ؑ\x0e̎\x7fuRFj\x13\x02A\x0e}8\x17\x0f\x07cg$\x7f\x19д\x0c\x0el\x17}
\x03\x1ev]\x0bd$<0\x174sΩGP]ןOP\x0fVєS\x0c\x05Rn.\x10]\x7f
\x05\x0eaIq\'\x11r*iH\x08I\x14	adE\x0ciR\x1b\x04
P`}uD\x15,{8,\x0b\x17MMϐ_=vyH
\x7fmn\x06\x1f#ώ3\x1dy1٘sR\x05at\x05.-"ڦ]hvKnhٞU)yuO6)w\x0bv[-\x03F9\\x7f

Output after soup:

}r۶?ʶt"QnlK
֤28   IH%Aj?Og
MYo(YrlϴĖH\
G??yL@f!3O_i뗗eS~K
2r&r5uG8`Z
th8өV
OW~AfSjPlFTC'h`ѫq;|
[(":k9'շ&ȕCȞ $Y4TOa#}!
PAq$xb30/0aN90zGoȡ$$.AXG瑈BAN   
    
\p&0<.0r;ןPĈQҸ
'vfer`qU
rR
ِ\imlOJ_GXyiA
ft7ǍF
[GGǝΓGݕƜ]R
   xi^3?u~zѓ/.&fj\OykO1tE=<Γ
PSxP+2>u:H^-І^]#^Ym{q}o@
kLp`OԱl
1߽7xLfaSB3f2*.eS
;ڀI@FPֽ%a1rYΰ53hK\CzMa쪦Xƌ3jcwqMZ® Âh#}3wB; !
TU:br:-ÿ,wm^qRn3& mF i`3e*

I'm getting somewhat frustrated and confused here. Would anyone have an insight on what's wrong?

Appreciating all help, thanks.

Scraping website 503 error and output problem

Answers (1)

Related Questions