Reputation: 25
I am trying to scrape some information from this website: https://www.nordnet.se/marknaden/aktiekurser?sortField=name&sortOrder=asc&exchangeCountry=SE&exchangeList=se%3Alargecapstockholmsek.
What I want to do is grab the sector information for each company, which is provided under the "Om bolaget"-tab in the company-specific pages. More specifically the information I want to get is in the "Sektor" and "Branch" fields. The links to the company specific pages can easily be obtained with requests
and BeautifulSoup
in python.
When making a get request to these links, the response sometimes contains the wanted information in the following form "sector: ..." and "sector_group: ...", but not always. One example when it works is for Latour https://www.nordnet.se/marknaden/aktiekurser/16099736-latour-investmentab-b, and one example when is doesn't work is for EQT https://www.nordnet.se/marknaden/aktiekurser/17117956-eqt.
Note that I see that an XHR-request (POST-request) is being made when pressing "Om bolaget", but I am not sure how to exploit it.
The code I use to grab the sector information from a company-specific page is provided below:
import requests
from bs4 import BeautifulSoup
import re
def get_sector(url):
sector, sector_group = None, None
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
tags = soup.findAll('script')
for tag in tags:
content = tag.get_text()
content = content.replace('\\', '')
if '__initialState__' not in content:
continue
try:
sector = re.findall(r'"sector":"\w+"', content)[0]
sector = json.loads('{' + sector + '}')
sector = sector['sector']
except IndexError:
print(url)
print('Sector not found')
try:
sector_group = re.findall(r'"sector_group":"\w+"', content)[0]
sector_group = json.loads('{' + sector_group + '}')
sector_group = sector_group['sector_group']
except IndexError:
print('Sector Group not found')
break
return sector, sector_group
Any input would be much appreciated.
Upvotes: 2
Views: 325
Reputation: 12255
To get Om bolaget
batch you have to get ntag from https://www.nordnet.se/api/2/login/anonymous response headers. You can take it once and use later in other requests. Best way is to use
requests.session()for that. In
data` 17117956 and 16099736 should be variables:
headers = {
'Connection': 'keep-alive',
'Content-Length': '0',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Origin': 'https://www.nordnet.se',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'ntag': 'NO_NTAG_RECEIVED_YET',
'content-type': 'application/x-www-form-urlencoded',
'accept': 'application/json',
'client-id': 'NEXT',
'DNT': '1',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Referer': 'https://www.nordnet.se/se',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
}
with requests.session() as s:
r = s.post('https://www.nordnet.se/api/2/login/anonymous', headers=headers)
headers['ntag'] = r.headers['ntag']
headers['content-type'] = 'application/json'
headers['accept'] = 'application/json'
for company_id in ['17117956', '16099736']:
data = '{"batch":"[{\\"relative_url\\":\\"company_data/keyfigures/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/yearlyfinancial/' + company_id + '\\",\\"method\\":\\"GET\\"},{\\"relative_url\\":\\"company_data/summary/' + company_id + '\\",\\"method\\":\\"GET\\"}]"}'
r = s.post('https://www.nordnet.se/api/2/batch', headers=headers, data=data)
print(r.text)
Upvotes: 1