Reputation: 37
I started to scraping this website with Selenium https://www.flashscore.com/ but is very slow process because I have to scrape thousands of urls, so I looked for a faster method with Requests
import requests
from bs4 import BeautifulSoup
import json
import re
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script',{'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
headers.update({'x-fsign':fsign})
url = "https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB"
response = requests.get(url, headers=headers)
print(response.status_code)
print(response.text.strip())
Output
SE÷Match¬~SG÷Ball Possession¬SH÷41%¬SI÷59%¬~SG÷Goal Attempts¬SH÷10¬SI÷20¬~SG÷Shots on Goal¬SH÷4¬SI÷3¬~SG÷Shots off Goal¬SH÷3¬SI÷9¬~SG÷Blocked Shots¬SH÷3¬SI÷8¬~SG÷Free Kicks¬SH÷8¬SI÷11¬~SG÷Corner Kicks¬SH÷6¬SI÷7¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷15¬SI÷15¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷4¬~SG÷Fouls¬SH÷10¬SI÷7¬~SG÷Total Passes¬SH÷389¬SI÷584¬~SG÷Tackles¬SH÷14¬SI÷15¬~SG÷Attacks¬SH÷107¬SI÷105¬~SG÷Dangerous Attacks¬SH÷77¬SI÷53¬~SE÷1st Half¬~SG÷Ball Possession¬SH÷37%¬SI÷63%¬~SG÷Goal Attempts¬SH÷5¬SI÷13¬~SG÷Shots on Goal¬SH÷1¬SI÷1¬~SG÷Shots off Goal¬SH÷2¬SI÷7¬~SG÷Blocked Shots¬SH÷2¬SI÷5¬~SG÷Free Kicks¬SH÷2¬SI÷5¬~SG÷Corner Kicks¬SH÷2¬SI÷2¬~SG÷Offsides¬SH÷0¬SI÷0¬~SG÷Throw-in¬SH÷10¬SI÷9¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷1¬~SG÷Fouls¬SH÷5¬SI÷2¬~SG÷Total Passes¬SH÷188¬SI÷331¬~SG÷Tackles¬SH÷8¬SI÷10¬~SG÷Attacks¬SH÷50¬SI÷61¬~SG÷Dangerous Attacks¬SH÷35¬SI÷29¬~SE÷2nd Half¬~SG÷Ball Possession¬SH÷45%¬SI÷55%¬~SG÷Goal Attempts¬SH÷5¬SI÷7¬~SG÷Shots on Goal¬SH÷3¬SI÷2¬~SG÷Shots off Goal¬SH÷1¬SI÷2¬~SG÷Blocked Shots¬SH÷1¬SI÷3¬~SG÷Free Kicks¬SH÷6¬SI÷6¬~SG÷Corner Kicks¬SH÷4¬SI÷5¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷5¬SI÷6¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷3¬~SG÷Fouls¬SH÷5¬SI÷5¬~SG÷Total Passes¬SH÷201¬SI÷253¬~SG÷Tackles¬SH÷6¬SI÷5¬~SG÷Attacks¬SH÷57¬SI÷44¬~SG÷Dangerous Attacks¬SH÷42¬SI÷24¬~A1÷¬~
With this code I can access to an url where there are stats in a particular format but, how can be possible scrape data from that file and get other stats like teams, scores and datetime?
The urls that are going to be scrape are like this https://www.flashscore.com/match/tE4RoHzB/#match-summary/match-summary
Making some changes from patterns
Match
Stat Ball Possession Home 41% Away 59%
Stat Goal Attempts Home 10 Away 20
Stat Shots on Goal Home 4 Away 3
Stat Shots off Goal Home 3 Away 9
Stat Blocked Shots Home 3 Away 8
Stat Free Kicks Home 8 Away 11
Stat Corner Kicks Home 6 Away 7
Stat Offsides Home 1 Away 1
Stat Throw-in Home 15 Away 15
Stat Goalkeeper Saves Home 0 Away 4
Stat Fouls Home 10 Away 7
Stat Total Passes Home 389 Away 584
Stat Tackles Home 14 Away 15
Stat Attacks Home 107 Away 105
Stat Dangerous Attacks Home 77 Away 53
1st Half
Stat Ball Possession Home 37% Away 63%
Stat Goal Attempts Home 5 Away 13
Stat Shots on Goal Home 1 Away 1
Stat Shots off Goal Home 2 Away 7
Stat Blocked Shots Home 2 Away 5
Stat Free Kicks Home 2 Away 5
Stat Corner Kicks Home 2 Away 2
Stat Offsides Home 0 Away 0
Stat Throw-in Home 10 Away 9
Stat Goalkeeper Saves Home 0 Away 1
Stat Fouls Home 5 Away 2
Stat Total Passes Home 188 Away 331
Stat Tackles Home 8 Away 10
Stat Attacks Home 50 Away 61
Stat Dangerous Attacks Home 35 Away 29
2nd Half
Stat Ball Possession Home 45% Away 55%
Stat Goal Attempts Home 5 Away 7
Stat Shots on Goal Home 3 Away 2
Stat Shots off Goal Home 1 Away 2
Stat Blocked Shots Home 1 Away 3
Stat Free Kicks Home 6 Away 6
Stat Corner Kicks Home 4 Away 5
Stat Offsides Home 1 Away 1
Stat Throw-in Home 5 Away 6
Stat Goalkeeper Saves Home 0 Away 3
Stat Fouls Home 5 Away 5
Stat Total Passes Home 201 Away 253
Stat Tackles Home 6 Away 5
Stat Attacks Home 57 Away 44
Stat Dangerous Attacks Home 42 Away 24
Upvotes: 2
Views: 960
Reputation: 142641
When I observer Network
in DevTools
(with filter XHR
and with addresses filtered by text _1_
)
then I see other values in different urls
Match Summary
Statictics
Formation & Starting Lineups
Comments
All of them gives strange strings but I see some pattern in strings:
~
means new line
¬
split items in line
÷
split item into name,value
If I use this to reformat data then it looks more readable but it still need to organize it in lists and dictionares. And every url will need own code for this. So I skip this part.
For Statistics
Minimal working code:
import requests
from bs4 import BeautifulSoup
import json
import re
def display(text):
text = text.strip()
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
print('>', '|'.join(parts))
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
s = requests.Session()
s.headers.update(headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
s.headers.update({'x-fsign':fsign})
print('--- Match Summary ---')
url = 'https://d.flashscore.com/x/feed/dc_1_tE4RoHzB'
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_sui_1_tE4RoHzB'
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_dos_1_tE4RoHzB_'
response = s.get(url)
display(response.text)
print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
print('--- Formation & Starting Lineups ---')
url = 'https://d.flashscore.com/x/feed/df_scr_1_tE4RoHzB' # OK
response = s.get(url)
display(response.text)
url = 'https://d.flashscore.com/x/feed/df_li_1_tE4RoHzB' # OK
response = s.get(url)
display(response.text)
print('--- Comments ---')
url = "https://d.flashscore.com/x/feed/df_lc_1_tE4RoHzB" # OK
response = s.get(url, headers={'x-fsign':fsign})
display(response.text)
Result (partially)
--- Match Summary ---
> AC|1st Half
> IG|0
> IH|1
>
> III|nTADt7er
> IA|2
> IB|43'
> IE|3
> IF|Firmino R.
> IU|/player/firmino-roberto/CCNtplZe/
> ICT|Goal! Andrew Robertson tees up Roberto Firmino<br />(Liverpool) inside the box, and he keeps<br />his cool to find the bottom right corner.<br />0:1.
> IK|Goal
> IM|CCNtplZe
> IN|452994
> IO|
> IE|8
> IF|Robertson A.
> IU|/player/robertson-andrew/6e7Be9VI/
> ICT|
> IK|Assistance
> IM|6e7Be9VI
--- Statictics ---
> SE|Match
>
> SG|Ball Possession
> SH|41%
> SI|59%
>
> SG|Goal Attempts
> SH|10
> SI|20
>
> SG|Shots on Goal
> SH|4
> SI|3
>
> SG|Shots off Goal
> SH|3
> SI|9
>
> SG|Blocked Shots
> SH|3
> SI|8
>
--- Formation & Starting Lineups ---
> SPT|1
> SPI|bDmUiRg3
> SPF|199
> SPG|Scotland
> SPR|/player/bardsley-phillip/bDmUiRg3/
> SPN|Bardsley P.
> SPC|1
> SPU|0
> SPE|Hernia
> SPD|There is some chance of playing.
>
> SPT|1
> SPI|GjF2pGwT
> SPF|96
> SPG|Ireland
> SPR|/player/brady-robbie/GjF2pGwT/
> SPN|Brady R.
> SPC|1
> SPU|0
> SPE|Calf Injury
> SPD|There is some chance of playing.
>
--- Comments ---
> MA|
>
> MB|90+4'
> MK|90:00 +3:29
> MC|whistle
> MD|There will be no more action in this match as the referee signals full time.
> MF|1
> MG|740
> MH|https://media-content-enetpulse.secure.footprint.net/gallery/2021/5/19/7df4d412740f558c20283d63a180c964o2.jpg
>
> MB|90+4'
> MK|90:00 +3:20
> MC|corner
> MD|Liverpool failed to take advantage of the corner as the opposition's defence was alert and averted the threat. Liverpool are still threatening though, as it's a corner.
>
> MB|90+3'
> MK|90:00 +2:50
> MC|
> MD|Mohamed Salah (Liverpool) skips past his man but can't keep the ball in play. Liverpool earn a corner.
> ME|1
> MF|1
>
EDIT:
Version which format Statisctics
to pandas DataFrame
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
def display(text):
text = text.strip()
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
print('>', '|'.join(parts))
def format_statisctic(text):
text = text.strip()
data = []
row = []
match_part = '' # to remember if it is full match or 1st/2nd halg
for line in text.split('~'):
items = line.split('¬')
for item in items:
parts = item.split('÷')
# remember
if parts[0] == 'SE':
match_part = parts[1]
# create row with data
if parts[0] in ('SG', 'SH', 'SI'):
row.append(parts[1])
# add row to data with `match_part`
if len(row) == 3:
data.append([match_part] + row)
# empty row for new data
row = []
# convert all to DataFrame
df = pd.DataFrame(data, columns=['Part', 'Stat', 'SH', 'SI'])
print(df)
# -------------------------
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
s = requests.Session()
s.headers.update(headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
if 'window.environment =' in str(script):
scriptStr = str(script)
jsonMatch = re.compile("{.*}")
jsonStr = jsonMatch.search(scriptStr)[0]
jsonData = json.loads(jsonStr)
fsign = jsonData['config']['app']['feed_sign']
s.headers.update({'x-fsign':fsign})
print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
text_statictics = response.text
format_statisctic(text_statictics)
Result:
Part Stat SH SI
0 Match Ball Possession 41% 59%
1 Match Goal Attempts 10 20
2 Match Shots on Goal 4 3
3 Match Shots off Goal 3 9
4 Match Blocked Shots 3 8
5 Match Free Kicks 8 11
6 Match Corner Kicks 6 7
7 Match Offsides 1 1
8 Match Throw-in 15 15
9 Match Goalkeeper Saves 0 4
10 Match Fouls 10 7
11 Match Total Passes 389 584
12 Match Tackles 14 15
13 Match Attacks 107 105
14 Match Dangerous Attacks 77 53
15 1st Half Ball Possession 37% 63%
16 1st Half Goal Attempts 5 13
17 1st Half Shots on Goal 1 1
18 1st Half Shots off Goal 2 7
19 1st Half Blocked Shots 2 5
20 1st Half Free Kicks 2 5
21 1st Half Corner Kicks 2 2
22 1st Half Offsides 0 0
23 1st Half Throw-in 10 9
24 1st Half Goalkeeper Saves 0 1
25 1st Half Fouls 5 2
26 1st Half Total Passes 188 331
27 1st Half Tackles 8 10
28 1st Half Attacks 50 61
29 1st Half Dangerous Attacks 35 29
30 2nd Half Ball Possession 45% 55%
31 2nd Half Goal Attempts 5 7
32 2nd Half Shots on Goal 3 2
33 2nd Half Shots off Goal 1 2
34 2nd Half Blocked Shots 1 3
35 2nd Half Free Kicks 6 6
36 2nd Half Corner Kicks 4 5
37 2nd Half Offsides 1 1
38 2nd Half Throw-in 5 6
39 2nd Half Goalkeeper Saves 0 3
40 2nd Half Fouls 5 5
41 2nd Half Total Passes 201 253
42 2nd Half Tackles 6 5
43 2nd Half Attacks 57 44
44 2nd Half Dangerous Attacks 42 24
Upvotes: 3