Reputation: 93

Python Webscraping HTTP returns 403 Forbidden Status Code

I'm trying to scrape this site and I get 403 code its the first time I've had this code when web scraping and I don't really understand what I have to do to solve it. I think maybe I can use Selenium to scrape the page, but I wonder if its possible to get the AJAX response and get the JSON as a return. If its not possible to get a return could I get an explaination of why? Thanks.

Here is my code:

import requests
url = 'https://public-api.pricempire.com/api/item/loadGraph/14/1140'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'
}

r = requests.get(url, headers=headers)
print(r.status_code)

Code generated from cURL insomnia

import requests

url = "https://public-api.pricempire.com/api/item/loadGraph/14/875"

payload = ""
headers = {
    "authority": "public-api.pricempire.com",
    "pragma": "no-cache",
    "cache-control": "no-cache",
    "sec-ch-ua": "^\^"
}

response = requests.request("GET", url, data=payload, headers=headers)

print(response.text)

First two times I ran it, it gave me status 200, but afterwards it gives me 403, I'm trying to figure out why and I just don't know.

Upvotes: 1

Answers (2)

KBill

Reputation: 85

Sometimes a lot of techniques doesn't work. So the final way is to get the content of the Google Cache.

import requests

# The headers 
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0'}

# The URL you want to scrap
url_2_scrap = 'https://www.my_url.com'

# Full URL to get the content 
url_full = 'https://webcache.googleusercontent.com/search?q=cache:' + url_2_scrap

# Response of the request
response = requests.get(url_full, headers=headers)

# If the status is good,
if response.status_code == 200:
    print("OK! It works fine! ;-)")
# If its not good,
else:
    print("It doesn't work :-(")

Upvotes: 0

kosciej16

Reputation: 7158

This page looks like it isn't public so there is need for some sort of authenticate earlier. In such case you need to see what authenticate mechanism is used and tried to reproduce that with requests library.

So open web inspector in browser, go to network tab, right click the request to page and copy as cURL. Probably you would see some bearer token in headers (or maybe there will be some cookie with session_id), append it to your program headers/cookies and it should work.

Upvotes: 2

Python Webscraping HTTP returns 403 Forbidden Status Code

Answers (2)

Related Questions