gromme12
gromme12

Reputation: 37

Scraping Dynamic page with Requests

I started to scraping this website with Selenium https://www.flashscore.com/ but is very slow process because I have to scrape thousands of urls, so I looked for a faster method with Requests

import requests
from bs4 import BeautifulSoup
import json
import re


url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script',{'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)

fsign = jsonData['config']['app']['feed_sign']
headers.update({'x-fsign':fsign})
url = "https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB"

response = requests.get(url, headers=headers)

print(response.status_code)
print(response.text.strip())

Output

SE÷Match¬~SG÷Ball Possession¬SH÷41%¬SI÷59%¬~SG÷Goal Attempts¬SH÷10¬SI÷20¬~SG÷Shots on Goal¬SH÷4¬SI÷3¬~SG÷Shots off Goal¬SH÷3¬SI÷9¬~SG÷Blocked Shots¬SH÷3¬SI÷8¬~SG÷Free Kicks¬SH÷8¬SI÷11¬~SG÷Corner Kicks¬SH÷6¬SI÷7¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷15¬SI÷15¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷4¬~SG÷Fouls¬SH÷10¬SI÷7¬~SG÷Total Passes¬SH÷389¬SI÷584¬~SG÷Tackles¬SH÷14¬SI÷15¬~SG÷Attacks¬SH÷107¬SI÷105¬~SG÷Dangerous Attacks¬SH÷77¬SI÷53¬~SE÷1st Half¬~SG÷Ball Possession¬SH÷37%¬SI÷63%¬~SG÷Goal Attempts¬SH÷5¬SI÷13¬~SG÷Shots on Goal¬SH÷1¬SI÷1¬~SG÷Shots off Goal¬SH÷2¬SI÷7¬~SG÷Blocked Shots¬SH÷2¬SI÷5¬~SG÷Free Kicks¬SH÷2¬SI÷5¬~SG÷Corner Kicks¬SH÷2¬SI÷2¬~SG÷Offsides¬SH÷0¬SI÷0¬~SG÷Throw-in¬SH÷10¬SI÷9¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷1¬~SG÷Fouls¬SH÷5¬SI÷2¬~SG÷Total Passes¬SH÷188¬SI÷331¬~SG÷Tackles¬SH÷8¬SI÷10¬~SG÷Attacks¬SH÷50¬SI÷61¬~SG÷Dangerous Attacks¬SH÷35¬SI÷29¬~SE÷2nd Half¬~SG÷Ball Possession¬SH÷45%¬SI÷55%¬~SG÷Goal Attempts¬SH÷5¬SI÷7¬~SG÷Shots on Goal¬SH÷3¬SI÷2¬~SG÷Shots off Goal¬SH÷1¬SI÷2¬~SG÷Blocked Shots¬SH÷1¬SI÷3¬~SG÷Free Kicks¬SH÷6¬SI÷6¬~SG÷Corner Kicks¬SH÷4¬SI÷5¬~SG÷Offsides¬SH÷1¬SI÷1¬~SG÷Throw-in¬SH÷5¬SI÷6¬~SG÷Goalkeeper Saves¬SH÷0¬SI÷3¬~SG÷Fouls¬SH÷5¬SI÷5¬~SG÷Total Passes¬SH÷201¬SI÷253¬~SG÷Tackles¬SH÷6¬SI÷5¬~SG÷Attacks¬SH÷57¬SI÷44¬~SG÷Dangerous Attacks¬SH÷42¬SI÷24¬~A1÷¬~

With this code I can access to an url where there are stats in a particular format but, how can be possible scrape data from that file and get other stats like teams, scores and datetime?

The urls that are going to be scrape are like this https://www.flashscore.com/match/tE4RoHzB/#match-summary/match-summary

Making some changes from patterns

Match
Stat Ball Possession Home 41% Away 59%
Stat Goal Attempts Home 10 Away 20
Stat Shots on Goal Home 4 Away 3
Stat Shots off Goal Home 3 Away 9
Stat Blocked Shots Home 3 Away 8
Stat Free Kicks Home 8 Away 11
Stat Corner Kicks Home 6 Away 7
Stat Offsides Home 1 Away 1
Stat Throw-in Home 15 Away 15
Stat Goalkeeper Saves Home 0 Away 4
Stat Fouls Home 10 Away 7
Stat Total Passes Home 389 Away 584
Stat Tackles Home 14 Away 15
Stat Attacks Home 107 Away 105
Stat Dangerous Attacks Home 77 Away 53
1st Half
Stat Ball Possession Home 37% Away 63%
Stat Goal Attempts Home 5 Away 13
Stat Shots on Goal Home 1 Away 1
Stat Shots off Goal Home 2 Away 7
Stat Blocked Shots Home 2 Away 5
Stat Free Kicks Home 2 Away 5
Stat Corner Kicks Home 2 Away 2
Stat Offsides Home 0 Away 0
Stat Throw-in Home 10 Away 9
Stat Goalkeeper Saves Home 0 Away 1
Stat Fouls Home 5 Away 2
Stat Total Passes Home 188 Away 331
Stat Tackles Home 8 Away 10
Stat Attacks Home 50 Away 61
Stat Dangerous Attacks Home 35 Away 29
2nd Half
Stat Ball Possession Home 45% Away 55%
Stat Goal Attempts Home 5 Away 7
Stat Shots on Goal Home 3 Away 2
Stat Shots off Goal Home 1 Away 2
Stat Blocked Shots Home 1 Away 3
Stat Free Kicks Home 6 Away 6
Stat Corner Kicks Home 4 Away 5
Stat Offsides Home 1 Away 1
Stat Throw-in Home 5 Away 6
Stat Goalkeeper Saves Home 0 Away 3
Stat Fouls Home 5 Away 5
Stat Total Passes Home 201 Away 253
Stat Tackles Home 6 Away 5
Stat Attacks Home 57 Away 44
Stat Dangerous Attacks Home 42 Away 24

Upvotes: 2

Views: 960

Answers (1)

furas
furas

Reputation: 142641

When I observer Network in DevTools (with filter XHR and with addresses filtered by text _1_)
then I see other values in different urls

Match Summary

Statictics

Formation & Starting Lineups

Comments

All of them gives strange strings but I see some pattern in strings:

~ means new line
¬ split items in line
÷ split item into name,value

If I use this to reformat data then it looks more readable but it still need to organize it in lists and dictionares. And every url will need own code for this. So I skip this part.

For Statistics


Minimal working code:

import requests
from bs4 import BeautifulSoup
import json
import re

def display(text):
    text = text.strip()
    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')
            print('>', '|'.join(parts))
        
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

s = requests.Session()
s.headers.update(headers)

response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)


fsign = jsonData['config']['app']['feed_sign']

s.headers.update({'x-fsign':fsign})
                 
                 
print('--- Match Summary ---')
url = 'https://d.flashscore.com/x/feed/dc_1_tE4RoHzB'
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_sui_1_tE4RoHzB'
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_dos_1_tE4RoHzB_'
response = s.get(url)
display(response.text)

print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)

print('--- Formation & Starting Lineups ---')
url = 'https://d.flashscore.com/x/feed/df_scr_1_tE4RoHzB'  # OK
response = s.get(url)
display(response.text)

url = 'https://d.flashscore.com/x/feed/df_li_1_tE4RoHzB'  # OK
response = s.get(url)
display(response.text)

print('--- Comments ---')
url = "https://d.flashscore.com/x/feed/df_lc_1_tE4RoHzB"  # OK
response = s.get(url, headers={'x-fsign':fsign})
display(response.text)

Result (partially)

--- Match Summary ---

> AC|1st Half
> IG|0
> IH|1
> 
> III|nTADt7er
> IA|2
> IB|43'
> IE|3
> IF|Firmino R.
> IU|/player/firmino-roberto/CCNtplZe/
> ICT|Goal! Andrew Robertson tees up Roberto Firmino<br />(Liverpool) inside the box, and he keeps<br />his cool to find the bottom right corner.<br />0:1.
> IK|Goal
> IM|CCNtplZe
> IN|452994
> IO|
> IE|8
> IF|Robertson A.
> IU|/player/robertson-andrew/6e7Be9VI/
> ICT|
> IK|Assistance
> IM|6e7Be9VI

--- Statictics ---
> SE|Match
> 
> SG|Ball Possession
> SH|41%
> SI|59%
> 
> SG|Goal Attempts
> SH|10
> SI|20
> 
> SG|Shots on Goal
> SH|4
> SI|3
> 
> SG|Shots off Goal
> SH|3
> SI|9
> 
> SG|Blocked Shots
> SH|3
> SI|8
> 

--- Formation & Starting Lineups ---
> SPT|1
> SPI|bDmUiRg3
> SPF|199
> SPG|Scotland
> SPR|/player/bardsley-phillip/bDmUiRg3/
> SPN|Bardsley P.
> SPC|1
> SPU|0
> SPE|Hernia
> SPD|There is some chance of playing.
> 
> SPT|1
> SPI|GjF2pGwT
> SPF|96
> SPG|Ireland
> SPR|/player/brady-robbie/GjF2pGwT/
> SPN|Brady R.
> SPC|1
> SPU|0
> SPE|Calf Injury
> SPD|There is some chance of playing.
> 


--- Comments ---
> MA|
> 
> MB|90+4'
> MK|90:00 +3:29
> MC|whistle
> MD|There will be no more action in this match as the referee signals full time.
> MF|1
> MG|740
> MH|https://media-content-enetpulse.secure.footprint.net/gallery/2021/5/19/7df4d412740f558c20283d63a180c964o2.jpg
> 
> MB|90+4'
> MK|90:00 +3:20
> MC|corner
> MD|Liverpool failed to take advantage of the corner as the opposition's defence was alert and averted the threat. Liverpool are still threatening though, as it's a corner.
> 
> MB|90+3'
> MK|90:00 +2:50
> MC|
> MD|Mohamed Salah (Liverpool) skips past his man but can't keep the ball in play. Liverpool earn a corner.
> ME|1
> MF|1
> 

EDIT:

Version which format Statisctics to pandas DataFrame

import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd

def display(text):
    text = text.strip()
    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')
            print('>', '|'.join(parts))

def format_statisctic(text):
    text = text.strip()

    data = []
    
    row = []

    match_part = '' # to remember if it is full match or 1st/2nd halg

    for line in text.split('~'):
        items = line.split('¬')
        for item in items:
            parts = item.split('÷')

            # remember     
            if parts[0] == 'SE':
                match_part = parts[1]

            # create row with data
            if parts[0] in ('SG', 'SH', 'SI'):
                row.append(parts[1])

            # add row to data with `match_part`
            if len(row) == 3:
                data.append([match_part] + row)
                # empty row for new data
                row = []

    # convert all to DataFrame
    df = pd.DataFrame(data, columns=['Part', 'Stat', 'SH', 'SI'])
    
    print(df)

# -------------------------
        
url = 'https://www.flashscore.com/match/tE4RoHzB/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}

s = requests.Session()
s.headers.update(headers)

response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script', {'type':"text/javascript"})
for script in scripts:
    if 'window.environment =' in str(script):
        scriptStr = str(script)
        jsonMatch = re.compile("{.*}")
        jsonStr = jsonMatch.search(scriptStr)[0]
        jsonData = json.loads(jsonStr)


fsign = jsonData['config']['app']['feed_sign']

s.headers.update({'x-fsign':fsign})

print('--- Statictics ---')
url = 'https://d.flashscore.com/x/feed/df_st_1_tE4RoHzB'
response = s.get(url)
display(response.text)
text_statictics = response.text
format_statisctic(text_statictics)

Result:

        Part               Stat   SH   SI
0      Match    Ball Possession  41%  59%
1      Match      Goal Attempts   10   20
2      Match      Shots on Goal    4    3
3      Match     Shots off Goal    3    9
4      Match      Blocked Shots    3    8
5      Match         Free Kicks    8   11
6      Match       Corner Kicks    6    7
7      Match           Offsides    1    1
8      Match           Throw-in   15   15
9      Match   Goalkeeper Saves    0    4
10     Match              Fouls   10    7
11     Match       Total Passes  389  584
12     Match            Tackles   14   15
13     Match            Attacks  107  105
14     Match  Dangerous Attacks   77   53
15  1st Half    Ball Possession  37%  63%
16  1st Half      Goal Attempts    5   13
17  1st Half      Shots on Goal    1    1
18  1st Half     Shots off Goal    2    7
19  1st Half      Blocked Shots    2    5
20  1st Half         Free Kicks    2    5
21  1st Half       Corner Kicks    2    2
22  1st Half           Offsides    0    0
23  1st Half           Throw-in   10    9
24  1st Half   Goalkeeper Saves    0    1
25  1st Half              Fouls    5    2
26  1st Half       Total Passes  188  331
27  1st Half            Tackles    8   10
28  1st Half            Attacks   50   61
29  1st Half  Dangerous Attacks   35   29
30  2nd Half    Ball Possession  45%  55%
31  2nd Half      Goal Attempts    5    7
32  2nd Half      Shots on Goal    3    2
33  2nd Half     Shots off Goal    1    2
34  2nd Half      Blocked Shots    1    3
35  2nd Half         Free Kicks    6    6
36  2nd Half       Corner Kicks    4    5
37  2nd Half           Offsides    1    1
38  2nd Half           Throw-in    5    6
39  2nd Half   Goalkeeper Saves    0    3
40  2nd Half              Fouls    5    5
41  2nd Half       Total Passes  201  253
42  2nd Half            Tackles    6    5
43  2nd Half            Attacks   57   44
44  2nd Half  Dangerous Attacks   42   24

Upvotes: 3

Related Questions