Why can't scrape some webpages using Python and bs4?

Question

I've got this code with the purpose of getting the HTML code, and scrape it using bs4.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myUrl = '' #Here goes de the webpage.

# opening up connection and downloadind the page
uClient = uReq(myUrl) 
pageHtml = uClient.read()
uClient.close()

#html parse
pageSoup = soup(pageHtml, "html.parser")
print(pageSoup)

However, it does not work, here are the errors shown by the terminal:

Traceback (most recent call last):
  File "main.py", line 7, in 
    uClient = uReq(myUrl)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\ProgramData\Anaconda3\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

M. Abreu · Accepted Answer

You are missing some headers that the site may require.

I suggests using requests package instead of urllib, as it's more flexible. See a working example below:

import requests

url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"

querystring = {"shape":"((wt_{F`m{e@njvAqoaXjzjFhecJ{ebIfi}L))"}

payload = ""
headers = {
    'authority': "www.idealista.com",
    'cache-control': "max-age=0",
    'upgrade-insecure-requests': "1",
    'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
    'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    'sec-fetch-site': "none",
    'sec-fetch-mode': "navigate",
    'sec-fetch-user': "?1",
    'sec-fetch-dest': "document",
    'accept-language': "en-US,en;q=0.9"
    }

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

From there you can parse the body using bs4:

pageSoup = soup(response.text, "html.parser")

However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent header and IP address.

Why can't scrape some webpages using Python and bs4?

Answers (2)

Related Questions

Why can&#39;t scrape some webpages using Python and bs4?

Answers (2)

Related Questions

Why can't scrape some webpages using Python and bs4?