AJAX web scraping using python Requests

Question

I was trying to scrape this website but was'nt getting the table data. I even got the request data from the Chrome dev tools but I cannot find out what I'm doing wrong.

Here is my script:

import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)

Bertrand Martel · Accepted Answer

While it lacked initially the IFC-Cache-Header http header in the first place, there is also a JWT token that is passed via Authorization header.

To retrieve this token, you first need to extract values from the root page :

GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre

which features the following javacript object:

window.HSBC.dpas = {
    "pageInformation": {
        "country": "X", <========= HERE
        "language": "X", <========= HERE
        "tokenIssue": {
            "url": "/api/v1/token/issue",
        },
        "dataUrl": {
            "url": "/api/v1/nav/funds",
            "id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
        },
        ....
    }
}

You can extract the window.HSBC.dpas javascript object value using regex and then reformat the string so that it becomes valid JSON

These values are then passed in http headers such as X-COUNTRY, X-COMPONENT and X-LANGUAGE to the following call:

GET https://www.assetmanagement.hsbc.de/api/v1/token/issue

It returns the JWT token directly and add the Authorization header to the request as Authorization: Bearer {token}:

GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds

Example:

import requests
import re
import json

api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"

# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
    "f":"Yes",
    "n": "-1",
    "v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)

# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)

result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])

# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
    "X-Country": result["pageInformation"]["country"],
    "X-Component": result["pageInformation"]["dataUrl"]["id"],
    "X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)

# call /nav/funds API
payload={
    "appliedFilters":[[{"active":True,"id":"Yes"}]],
    "paging":{"fundsPerPage":-1,"currentPage":1},
    "view":"Documents",
    "searchTerm":[],
    "selectedValues":[],
    "pageInformation": result["pageInformation"]
}
headers={
    "IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
    "Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)

Try this on repl.it

AJAX web scraping using python Requests

Answers (1)

Related Questions