JunkLatte
JunkLatte

Reputation: 41

Python Web Scrape Request resulting in a 406 Error

I am trying to scrape https://registry.verra.org/app/search/VCS/All%20Projects for a school project. I am trying to send a request to the "download excel" button by replicating the POST request going on in the background.

Here's what I have so far.

import requests
import datetime as dt

url_back = 'https://registry.verra.org/uiapi/resource/resource/search?$skip=0&count=true&$format=excel&$exportFileName=allprojects.xlsx'
data = {"program":"VCS",
        "resourceStatuses":["VCS_EX_CRD_PRD_VER_REQUESTED","VCS_EX_CRD_PRD_REQUESTED",
                            "VCS_EX_REGISTERED","VCS_EX_REG_VER_APPR_REQUESTED",
                            "VCS_EX_REGISTRATION_REQUESTED","VCS_EX_REJ",
                            "VCS_EX_UNDER_DEVELOPMENT_CLD","VCS_EX_UNDER_DEVELOPMENT_OPN",
                            "VCS_EX_UNDER_VALIDATION_CLD","VCS_EX_UNDER_VALIDATION_OPN",
                            "VCS_EX_CRED_TRANS_FRM_OTHER_PROG","VCS_EX_WITHDRAWN"]}
headers = {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Content-Length": "369",
    "Content-Type": "application/json",
    "Cookie": "fpestid=9g1E7EZczSniadmveW8TL8DIBB_w-MDFov_fr0DQqgBD46kgkoVSzIdQHKP-hSxMbBr4tg; _ga=GA1.2.1884498504.1652482731; _gid=GA1.2.1741997157.1652482731; ASPSESSIONIDQERRTRAR=BFIILIADNEINGJAKKMCJGKKO",
    "Host": "registry.verra.org",
    "Origin": "https://registry.verra.org",
    "Referer": "https://registry.verra.org/app/search/VCS/All%20Projects",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36",
    "sec-ch-ua-mobile": "?1",
    "sec-ch-ua-platform": "Android"
    }

response = requests.post(url_back, data=data, headers=headers)
print(response)

with open('dwnld.xlsx', 'wb') as f:
    f.write(response.content)

However, the response returns a 406 error every time, even though I am using "/" in the accept line and a valid "User-Agent" that shouldn't be blocked. Any ideas as to why I am not able to get the POST to return a real response?

Upvotes: 1

Views: 292

Answers (3)

John Gordon
John Gordon

Reputation: 33335

headers = {
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
   ...

You've told the website that you will only accept responses that use these specific encodings, and these specific languages.

But the website can't deliver those. So it returns 406, telling you that it can't meet your requirements.

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195438

Try to use json= parameter instead of data=. headers= isn't necessary:

import requests

url = "https://registry.verra.org/uiapi/resource/resource/search?%24skip=0&count=true&%24format=excel&%24exportFileName=allprojects.xlsx"

payload = {
    "program": "VCS",
    "resourceStatuses": [
        "VCS_EX_CRD_PRD_VER_REQUESTED",
        "VCS_EX_CRD_PRD_REQUESTED",
        "VCS_EX_REGISTERED",
        "VCS_EX_REG_VER_APPR_REQUESTED",
        "VCS_EX_REGISTRATION_REQUESTED",
        "VCS_EX_REJ",
        "VCS_EX_UNDER_DEVELOPMENT_CLD",
        "VCS_EX_UNDER_DEVELOPMENT_OPN",
        "VCS_EX_UNDER_VALIDATION_CLD",
        "VCS_EX_UNDER_VALIDATION_OPN",
        "VCS_EX_CRED_TRANS_FRM_OTHER_PROG",
        "VCS_EX_WITHDRAWN",
    ],
}

with open("dwnld.xlsx", "wb") as f_out:
    f_out.write(requests.post(url, json=payload).content)

Saves dwnld.xlsx (screenshot from LibreOffice):

enter image description here

Upvotes: 1

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

Data parameter meaning body data is json . So you have to send data as json format as header like json = data

import requests
import datetime as dt

url_back = 'https://registry.verra.org/uiapi/resource/resource/search?$skip=0&count=true&$format=excel&$exportFileName=allprojects.xlsx'
data = {"program":"VCS",
        "resourceStatuses":["VCS_EX_CRD_PRD_VER_REQUESTED","VCS_EX_CRD_PRD_REQUESTED",
                            "VCS_EX_REGISTERED","VCS_EX_REG_VER_APPR_REQUESTED",
                            "VCS_EX_REGISTRATION_REQUESTED","VCS_EX_REJ",
                            "VCS_EX_UNDER_DEVELOPMENT_CLD","VCS_EX_UNDER_DEVELOPMENT_OPN",
                            "VCS_EX_UNDER_VALIDATION_CLD","VCS_EX_UNDER_VALIDATION_OPN",
                            "VCS_EX_CRED_TRANS_FRM_OTHER_PROG","VCS_EX_WITHDRAWN"]}
headers = {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
    "Connection": "keep-alive",
    "Content-Length": "369",
    "Content-Type": "application/json",
    "Cookie": "fpestid=9g1E7EZczSniadmveW8TL8DIBB_w-MDFov_fr0DQqgBD46kgkoVSzIdQHKP-hSxMbBr4tg; _ga=GA1.2.1884498504.1652482731; _gid=GA1.2.1741997157.1652482731; ASPSESSIONIDQERRTRAR=BFIILIADNEINGJAKKMCJGKKO",
    "Host": "registry.verra.org",
    "Origin": "https://registry.verra.org",
    "Referer": "https://registry.verra.org/app/search/VCS/All%20Projects",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36",
    "sec-ch-ua-mobile": "?1",
    "sec-ch-ua-platform": "Android"
    }

response = requests.post(url_back, json=data, headers=headers)
print(response)

# with open('dwnld.xlsx', 'wb') as f:
#     f.write(response.content)

Upvotes: 1

Related Questions