Reputation: 615
I have a specific section of a website that I want to scrape data from and here's the screenshot of the section -
I inspected the elements of that particular section and noticed that it's within a canvas tag. However, I also checked the source code of the website and I found that the data lies within the source code in a format I'm not familiar with. Here's a sample of that data
JSON.parse('\x5B\x7B\x22id\x22\x3A\x2232522\x22,\x22minute\x22\x3A\x2222\x22,\x22result\x22\x3A\x22MissedShots\x22,
\x22X\x22\x3A\x220.7859999847412109\x22,\x22Y\x22\x3A\x220.52\x22,\x22xG\x22\x3A\x220.03867039829492569\x22,
\x22player\x22\x3A\x22Lionel\x20Messi\x22,
\x22h_a\x22\x3A\x22h\x22,
\x22player_id\x22\x3A\x222097\x22,\x22situation\x22\x3A\x22OpenPlay\x22,
\x22season\x22\x3A\x222014\x22,\x22shotType\x22\x3A\x22LeftFoot\x22,
\x22match_id\x22\x3A...);
How do I parse through this data to give me the x,y co-ordinates of every shot from the map in the screenshot?
Upvotes: 4
Views: 1568
Reputation: 28640
Ya the issue is with the encoding/decoding.
You can pull that string and then essentially need to ignore the escape charachters. Once you do that, you can use json.loads()
to read that in and then can navigate the json structure.
Now I only looked quickly, but did not see the data in there to show where the plot is on the shot chart. But you can have a look to see if you can find it. The data does however have a shotZones
key.
import requests
from bs4 import BeautifulSoup
import json
import codecs
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var groupsData = JSON.parse' in script.text:
encoded_string = script.text
encoded_string = encoded_string .split("var groupsData = JSON.parse('")[-1]
encoded_string = encoded_string.rsplit("'),",1)[0]
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
Edit
Actually I found it. Here you go:
import requests
from bs4 import BeautifulSoup
import json
import codecs
from pandas.io.json import json_normalize
url = 'https://understat.com/player/2097'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
# I noticed the data was imbedded in the script tag that started with `var shotsData`
for script in scripts:
if 'var shotsData' in script.text:
# I store that text, then trim off the string on the ends so that
# it's in a valid json format
encoded_string = script.text
encoded_string = encoded_string .split("JSON.parse('", 1)[-1]
encoded_string = encoded_string.rsplit("player_info =",1)[0]
encoded_string = encoded_string.rsplit("'),",1)[0]
# Have it ignore the escape characters so it can decode the ascii
# and be able to use json.loads
jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj)
Output:
print (df)
X ... xG
0 0.7859999847412109 ... 0.03867039829492569
1 0.8619999694824219 ... 0.06870150566101074
2 0.86 ... 0.15034306049346924
3 0.8180000305175781 ... 0.045503295958042145
4 0.8690000152587891 ... 0.06531666964292526
5 0.7230000305175781 ... 0.054804932326078415
6 0.9119999694824219 ... 0.0971858948469162
7 0.885 ... 0.11467907577753067
8 0.875999984741211 ... 0.10627452284097672
9 0.9540000152587891 ... 0.3100203275680542
10 0.8969999694824219 ... 0.12571729719638824
11 0.8959999847412109 ... 0.04122981056571007
12 0.8730000305175781 ... 0.09942527115345001
13 0.769000015258789 ... 0.025321772322058678
14 0.885 ... 0.7432776093482971
15 0.86 ... 0.4680374562740326
16 0.7619999694824219 ... 0.05699075385928154
17 0.919000015258789 ... 0.10647356510162354
18 0.9530000305175781 ... 0.571601390838623
19 0.8280000305175781 ... 0.07561512291431427
20 0.9030000305175782 ... 0.4600500166416168
21 0.9469999694824218 ... 0.3132372796535492
22 0.92 ... 0.2869703769683838
23 0.7659999847412109 ... 0.07576987147331238
24 0.9640000152587891 ... 0.3824153244495392
25 0.8590000152587891 ... 0.1282796859741211
26 0.9330000305175781 ... 0.42914989590644836
27 0.9230000305175782 ... 0.4968196153640747
28 0.8240000152587891 ... 0.08198583126068115
29 0.965999984741211 ... 0.4309735596179962
.. ... ... ...
843 0.9159999847412109 ... 0.4672183692455292
844 0.7430000305175781 ... 0.04068271815776825
845 0.815 ... 0.07300572842359543
846 0.8980000305175782 ... 0.06551901996135712
847 0.7680000305175781 ... 0.028392281383275986
848 0.885 ... 0.7432776093482971
849 0.875999984741211 ... 0.4060465097427368
850 0.7880000305175782 ... 0.09496577084064484
851 0.7190000152587891 ... 0.05071594566106796
852 0.7680000305175781 ... 0.090679831802845
853 0.7440000152587891 ... 0.06875557452440262
854 0.9069999694824219 ... 0.45824503898620605
855 0.850999984741211 ... 0.06454816460609436
856 0.935 ... 0.5926618576049805
857 0.9219999694824219 ... 0.16091874241828918
858 0.73 ... 0.05882067605853081
859 0.9080000305175782 ... 0.3522365391254425
860 0.8209999847412109 ... 0.1690768003463745
861 0.850999984741211 ... 0.11893663555383682
862 0.88 ... 0.11993970721960068
863 0.8119999694824219 ... 0.15579797327518463
864 0.7019999694824218 ... 0.011425728909671307
865 0.7530000305175781 ... 0.06945621967315674
866 0.850999984741211 ... 0.08273076266050339
867 0.8180000305175781 ... 0.06529481709003448
868 0.86 ... 0.10793478786945343
869 0.8190000152587891 ... 0.061923813074827194
870 0.8130000305175781 ... 0.05294585973024368
871 0.799000015258789 ... 0.06358513236045837
872 0.9019999694824219 ... 0.5841030478477478
[873 rows x 20 columns]
Upvotes: 3