Reputation: 43
I've got this code with the purpose of getting the HTML code, and scrape it using bs4.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = '' #Here goes de the webpage.
# opening up connection and downloadind the page
uClient = uReq(myUrl)
pageHtml = uClient.read()
uClient.close()
#html parse
pageSoup = soup(pageHtml, "html.parser")
print(pageSoup)
However, it does not work, here are the errors shown by the terminal:
Traceback (most recent call last):
File "main.py", line 7, in <module>
uClient = uReq(myUrl)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Upvotes: 0
Views: 1853
Reputation: 366
You are missing some headers that the site may require.
I suggests using requests
package instead of urllib
, as it's more flexible. See a working example below:
import requests
url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"
querystring = {"shape":"((wt_{F`m{e@njvAqoaXjzjFhecJ{ebIfi}L))"}
payload = ""
headers = {
'authority': "www.idealista.com",
'cache-control': "max-age=0",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'sec-fetch-site': "none",
'sec-fetch-mode': "navigate",
'sec-fetch-user': "?1",
'sec-fetch-dest': "document",
'accept-language': "en-US,en;q=0.9"
}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
print(response.text)
From there you can parse the body using bs4:
pageSoup = soup(response.text, "html.parser")
However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent
header and IP address.
Upvotes: 1
Reputation: 341
A HTTP 403 error which you have received means that the web server rejected the request for the page made by the script because it did not have permission/credentials to access it.
I can access the page in your example from here, so most likely what happened is that the web server noticed that you were trying to scrape it and banned your IP address from requesting any more pages. Web servers often do this to prevent scrapers from affecting its performance.
The web site explicitly forbids what you are trying to do in their terms here: https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en
So I would suggest you contact the site owner to request an API to use (this probably won't be free though).
Upvotes: 0