403 error in web-scraping a specific website with Python

Question

I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.

Is there any way?

from urllib.request import urlopen, Request

url = "https://members.parliament.uk/members/commons"

headers={'User-Agent': 'Mozilla/5.0'}

request= Request(url=url, headers=headers) 
response = urlopen(request)
data = response.read()

The longer header is this:

headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive'
}

Md. Fazlul Hoque · Accepted Answer

The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4 and now it's working like a charm.

Example:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

data = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False,slow_mo=50)
    page = browser.new_page()
    page.goto('https://members.parliament.uk/members/commons')
    page.wait_for_timeout(5000)

    loc = page.locator('div[class="card-list card-list-2-col"]')
    html = loc.inner_html()
    #print(html)
    soup = BeautifulSoup(html,"lxml")
    #print(soup.prettify())
    for card in soup.select('.card.card-member'):
        d = {
        'Name':card.select_one('.primary-info').get_text(strip=True)
        }

        data.append(d)
   
print(data)

Output:

[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]

403 error in web-scraping a specific website with Python

Answers (1)

Related Questions