amatsuo_net
amatsuo_net

Reputation: 2448

403 error in web-scraping a specific website with Python

I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.

Is there any way?

from urllib.request import urlopen, Request

url = "https://members.parliament.uk/members/commons"

headers={'User-Agent': 'Mozilla/5.0'}

request= Request(url=url, headers=headers) 
response = urlopen(request)
data = response.read()

The longer header is this:

headers = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive'
}

Upvotes: 0

Views: 215

Answers (1)

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4 and now it's working like a charm.

Example:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

data = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False,slow_mo=50)
    page = browser.new_page()
    page.goto('https://members.parliament.uk/members/commons')
    page.wait_for_timeout(5000)

    loc = page.locator('div[class="card-list card-list-2-col"]')
    html = loc.inner_html()
    #print(html)
    soup = BeautifulSoup(html,"lxml")
    #print(soup.prettify())
    for card in soup.select('.card.card-member'):
        d = {
        'Name':card.select_one('.primary-info').get_text(strip=True)
        }

        data.append(d)
   
print(data)

Output:

[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]

Upvotes: 1

Related Questions