DoesntEven
DoesntEven

Reputation: 73

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.

I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.

Here's the code I'm using, that 403s. Same result with request.get etc.

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}

session = requests.Session()
req = session.get(URL, headers=headers)

So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.

Upvotes: 1

Views: 4927

Answers (2)

Le Khiem
Le Khiem

Reputation: 886

Add some headers as below (not only User-Agent):

def start_requests(self):
    headers = {        
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',        
        'Accept-Encoding': 'gzip, deflate, br',        
        'Accept-Language': 'en-US,en;q=0.5',        
        'Connection': 'keep-alive',        
        'Cookie': 'AMCV_0D15148954E6C5100A4C98BC%40AdobeOrg=1176715910%7CMCIDTS%7C19271%7CMCMID%7C80534695734291136713728777212980602826%7CMCAAMLH-1665548058%7C7%7CMCAAMB-1665548058%7C6G1ynYcLPuiQxYZrsz_pkqfLG9yMXBpb2zX5dvJdYQJzPXImdj0y%7CMCOPTOUT-1664950458s%7CNONE%7CMCAID%7CNONE%7CMCSYNCSOP%7C411-19272%7CvVersion%7C5.4.0; s_ecid=MCMID%7C80534695734291136713728777212980602826; __cfruid=37ff2049fc4dcffaab8d008026b166001c67dd49-1664418998; AMCVS_0D15148954E6C5100A4C98BC%40AdobeOrg=1; s_cc=true; __cf_bm=NIDFoL5PTkinis50ohQiCs4q7U4SZJ8oTaTW4kHT0SE-1664943258-0-AVwtneMLLP997IAVfltTqK949EmY349o8RJT7pYSp/oF9lChUSNLohrDRIHsiEB5TwTZ9QL7e9nAH+2vmXzhTtE=; PHPSESSID=ddf49facfda7bcb4656eea122199ea0d',                        
        'If-Modified-Since': 'Tue, 04 May 2021 05:09:49 GMT',        
        'If-None-Match': 'W/"12c6a-5c17a16600f6c-gzip"',        
        'Sec-Fetch-Dest': 'document',        
        'Sec-Fetch-Mode': 'navigate',        
        'Sec-Fetch-Site': 'none',        
        'Sec-Fetch-User': '?1',        
        'TE': 'trailers',        
        'Upgrade-Insecure-Requests': '1',        
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'        
    }
    for url in self.start_urls:
        yield Request(url, headers=headers)

Upvotes: 1

furas
furas

Reputation: 142651

I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.

import requests

url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'

headers = {
    'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}

r = requests.get(url, headers=headers)

data = r.json()

print(data['docs'][0]['name'])

Result:

The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

Upvotes: 3

Related Questions