iviivi
iviivi

Reputation: 43

Why can't scrape some webpages using Python and bs4?

I've got this code with the purpose of getting the HTML code, and scrape it using bs4.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myUrl = '' #Here goes de the webpage.

# opening up connection and downloadind the page
uClient = uReq(myUrl) 
pageHtml = uClient.read()
uClient.close()

#html parse
pageSoup = soup(pageHtml, "html.parser")
print(pageSoup)

However, it does not work, here are the errors shown by the terminal:

Traceback (most recent call last):
  File "main.py", line 7, in <module>
    uClient = uReq(myUrl)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Upvotes: 0

Views: 1853

Answers (2)

M. Abreu
M. Abreu

Reputation: 366

You are missing some headers that the site may require.

I suggests using requests package instead of urllib, as it's more flexible. See a working example below:

import requests

url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"

querystring = {"shape":"((wt_{F`m{e@njvAqoaXjzjFhecJ{ebIfi}L))"}

payload = ""
headers = {
    'authority': "www.idealista.com",
    'cache-control': "max-age=0",
    'upgrade-insecure-requests': "1",
    'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    'sec-fetch-site': "none",
    'sec-fetch-mode': "navigate",
    'sec-fetch-user': "?1",
    'sec-fetch-dest': "document",
    'accept-language': "en-US,en;q=0.9"
    }

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

From there you can parse the body using bs4:

pageSoup = soup(response.text, "html.parser")

However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent header and IP address.

Upvotes: 1

Karl
Karl

Reputation: 341

A HTTP 403 error which you have received means that the web server rejected the request for the page made by the script because it did not have permission/credentials to access it.

I can access the page in your example from here, so most likely what happened is that the web server noticed that you were trying to scrape it and banned your IP address from requesting any more pages. Web servers often do this to prevent scrapers from affecting its performance.

The web site explicitly forbids what you are trying to do in their terms here: https://www.idealista.com/ayuda/articulos/legal-statement/?lang=en

So I would suggest you contact the site owner to request an API to use (this probably won't be free though).

Upvotes: 0

Related Questions