Reputation: 483
So Im new to the world of web scraping and so far I've only really been using beautifulsoup to scrape text and images off websites. I thought Id try and scrape some data points off a graph to test my understanding but I got a bit confused at this graph.
After inspecting the element of the piece of data I wanted to extract, I saw this:
<span id="TSMAIN">: 100.7490637</span>
The problem is, my original idea for scraping the data points would be to have iterated through some sort of id list containing all the different data points (if that makes sense?).
Instead, it seems that all the data points are contained within this same element, and the value depends on where your cursor is on the graph.
My problem is, If I use beautifulsoups find function and type in that specific element with that attribute of id
= TSMAIN
, I get a none type return, because I am guessing unless I have my cursor on the actual graph nothing will show up there.
Code:
from bs4 import BeautifulSoup
import requests
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"}
url = "https://www.morningstar.co.uk/uk/funds/snapshot/snapshot.aspx?id=F0GBR050AQ&tab=13"
source=requests.get(url,headers=headers)
soup = BeautifulSoup(source.content,'lxml')
data = soup.find("span",attrs={"id":"TSMAIN"})
print(data)
Output
None
How can I extract all the data points of this graph?
Upvotes: 3
Views: 6162
Reputation: 28630
Seems like the data can be pulled form API. Only thing is the values it returns is relative to the start date entered in the payload. It'll set the out put of the start date to 0
, then the numbers after are relative to that date.
import requests
import pandas as pd
from datetime import datetime
from dateutil import relativedelta
userInput = input('Choose:\n\t1. 3 Month\n\t2. 6 Month\n\t3. 1 Year\n\t4. 3 Year\n\t5. 5 Year\n\t6. 10 Year\n\n -->: ')
userDict = {'1':3,'2':6,'3':12,'4':36,'5':60,'6':120}
n = datetime.now()
n = n - relativedelta.relativedelta(days=1)
n = n - relativedelta.relativedelta(months=userDict[userInput])
dateStr = n.strftime('%Y-%m-%d')
url = 'https://tools.morningstar.co.uk/api/rest.svc/timeseries_cumulativereturn/t92wz0sj7c'
data = []
idDict = {
'Schroder Managed Balanced Instl Acc':'F0GBR050AQ]2]0]FOGBR$$ALL',
'GBP Moderately Adventurous Allocation':'EUCA000916]8]0]CAALL$$ALL',
'Mixed Investment 40-85% Shares':'LC00000012]8]0]CAALL$$ALL',
'':'F00000ZOR1]7]0]IXALL$$ALL',}
for k, v in idDict.items():
payload = {
'encyId': 'GBP',
'idtype': 'Morningstar',
'frequency': 'daily',
'startDate': dateStr,
'performanceType': '',
'outputType': 'COMPACTJSON',
'id': v,
'decPlaces': '8',
'applyTrackRecordExtension': 'false'}
temp_data = requests.get(url, params=payload).json()
df = pd.DataFrame(temp_data)
df['timestamp'] = pd.to_datetime(df[0], unit='ms')
df['date'] = df['timestamp'].dt.date
df = df[['date',1]]
df.columns = ['date', k]
data.append(df)
final_df = pd.concat(
(iDF.set_index('date') for iDF in data),
axis=1, join='inner'
).reset_index()
final_df.plot(x="date", y=list(idDict.keys()), kind="line")
Output:
print (final_df.head(5).to_string())
date Schroder Managed Balanced Instl Acc GBP Moderately Adventurous Allocation Mixed Investment 40-85% Shares
0 2019-12-22 0.000000 0.000000 0.000000 0.000000
1 2019-12-23 0.357143 0.406784 0.431372 0.694508
2 2019-12-24 0.714286 0.616217 0.632422 0.667586
3 2019-12-25 0.714286 0.616217 0.632422 0.655917
4 2019-12-26 0.714286 0.612474 0.629152 0.664124
....
To get those Ids, it took a little investigating of the requests. Searching through those, I was able to find the corresponding id values and with a little bit of trial and error to work out what values meant what.
Those "alternate" ids used. And where those line graphs get the data from (inthose 4 request, look at the Preview pane, and you'll see the data in there.
Here's the final output/graph:
Upvotes: 5