Reputation: 21
I was trying to scrape this website but was'nt getting the table data. I even got the request data from the Chrome dev tools but I cannot find out what I'm doing wrong.
Here is my script:
import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)
Upvotes: 1
Views: 279
Reputation: 45493
While it lacked initially the IFC-Cache-Header
http header in the first place, there is also a JWT token that is passed via Authorization
header.
To retrieve this token, you first need to extract values from the root page :
GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre
which features the following javacript object:
window.HSBC.dpas = {
"pageInformation": {
"country": "X", <========= HERE
"language": "X", <========= HERE
"tokenIssue": {
"url": "/api/v1/token/issue",
},
"dataUrl": {
"url": "/api/v1/nav/funds",
"id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
},
....
}
}
You can extract the window.HSBC.dpas
javascript object value using regex and then reformat the string so that it becomes valid JSON
These values are then passed in http headers such as X-COUNTRY
, X-COMPONENT
and X-LANGUAGE
to the following call:
GET https://www.assetmanagement.hsbc.de/api/v1/token/issue
It returns the JWT token directly and add the Authorization
header to the request as Authorization: Bearer {token}
:
GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds
Example:
import requests
import re
import json
api_url = "https://www.assetmanagement.hsbc.de/api/v1"
funds_url=f"{api_url}/nav/funds"
token_url = f"{api_url}/token/issue"
# call the /fund-centre url to get the documentID value in the javascript
url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
r = requests.get(url,
params = {
"f":"Yes",
"n": "-1",
"v": "Documents"
})
# this gets the javascript object
res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
group = res.group(1)
# convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
result_json = re.sub(regex, "", group, 0)
result = json.loads(result_json)
print(result["pageInformation"]["dataUrl"])
# call /token/issue API to get a token
r = requests.post(token_url,
headers= {
"X-Country": result["pageInformation"]["country"],
"X-Component": result["pageInformation"]["dataUrl"]["id"],
"X-Language": result["pageInformation"]["language"]
}, data={})
token = r.text
print(token)
# call /nav/funds API
payload={
"appliedFilters":[[{"active":True,"id":"Yes"}]],
"paging":{"fundsPerPage":-1,"currentPage":1},
"view":"Documents",
"searchTerm":[],
"selectedValues":[],
"pageInformation": result["pageInformation"]
}
headers={
"IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
"Authorization": f"Bearer {token}"
}
r = requests.post(funds_url,headers=headers,json=payload)
print(r.content)
Upvotes: 2