Reputation: 35
I am making a betting software that scrapes the fixtures from a website, puts them into an Excel file and then compares it with another Excel file that I have already manually created to find the winner (the source) The source file would look like this and The scraped output would look like this
What I want to do is for the software to find matching fixtures between the two files, then figure out the result using the "Result" column in the source file.
Problem is I don't know how to do it, and I have no idea how to look for it, I'm such a beginner and I picked this project for school, can someone at least give me a name to what I'm trying to do (if it has a name), an idea, a path to follow, etc...
Thanks a lot to everyone !
Here is my code for now (Only scraping and making an Excel file)
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
#Formation de l'url à scraper
url= "https://www.bbc.com/sport/football/scores-fixtures/"
date = input(str("Input the date, hit Enter for today"))
src = url + date
#Scraping de la page web, balise abbr avec attribut title
html = requests.get(src).text
soup = bs(html, 'lxml')
select = soup.find_all('abbr')
fixtures = []
for abbr in select:
if abbr.has_attr('title'):
output = str(abbr['title'])
fixtures.append(output)
#Création tableau excel, liste[position :: itération]
table = pd.DataFrame()
table['Home'] = pd.Series(fixtures[::2])
table['Away'] = pd.Series(fixtures[1::2])
name = 'fixtures' + '-' + date + '.xlsx'
table.to_excel(name, index=False)
=====EDIT====== I think it's gonna be easier if I explain the purpose of my Excel manual Source. It is a file that contains some very renowned fixtures and their outcome (the result), it's what we call in French "bête noire", basically a team that never wins against another. And now what I'm trying to do is for the program to look up the games of the day (or any certain date), match it against that Excel file to see if there's any "bête noire", if so, it should return the fixture with the winner (result/outcome).
EXAMPLE :
The scraped data :
Home | Away |
---|---|
Liverpool | Manchester U |
Burnley | West Ham |
Arsenal | Chelsea |
Tottenham | Brentford |
The Manual Excel source file that I have :
Home | Away | Result |
---|---|---|
Liverpool | Manchester U | Liverpool |
TestTeam | TestTeam | TestTeam |
TestTeam | TestTeam | TestTeam |
TestTeam | TestTeam | TestTeam |
Arsenal | Chelsea | Chelsea |
TestTeam | TestTeam | TestTeam |
TestTeam | TestTeam | TestTeam |
TestTeam | TestTeam | TestTeam |
Now the program should be able to match the data that it scraped against my Excel file, find the matching fixtures (Liverpool//Manchester U ; Arsenal//Chelsea), then return the result of the fixture.
I hope this has cleared any confusion, and thanks to everyone for their help I really appreciate it !
Upvotes: 1
Views: 984
Reputation: 28565
The data you are needing/asking for here is all within the api. You don't need to use Selnium then. Just get all the events data and you can manipulate the dataframe as you need.
This will get all the data for you:
import requests
import pandas as pd
import datetime
today = datetime.datetime.now().strftime("%Y-%m-%d")
date_str = input("Input the date in the form of yyyy-mm-dd, hit Enter for today\n-> ")
if not date_str:
date = today
else:
date = datetime.datetime.strptime(date_str, "%Y-%m-%d").strftime("%Y-%m-%d")
url = 'https://push.api.bbci.co.uk/batch'
query = {
't':f'/data/bbc-morph-football-scores-match-list-data/endDate/{date}/startDate/{date}/todayDate/{today}/tournament/full-priority-order/version/2.4.6?timeout=5'}
jsonData = requests.get(url, params=query).json()
rows = []
matchData = jsonData['payload'][0]['body']['matchData']
for match in matchData:
tournamentDatesWithEvents = match['tournamentDatesWithEvents']
for k, v in tournamentDatesWithEvents.items():
for i in v:
events = i['events']
for event in events:
rows.append(event)
event_df = pd.json_normalize(rows)
Output: sample of just 1st 5 rows
Input the date in the form of yyyy-mm-dd, hit Enter for today
-> 2021-11-10
print(event_df.head(5).to_string())
eventKey startTime isTBC minutesElapsed minutesIntoAddedTime eventStatus eventStatusNote eventStatusReason eventOutcomeType eventType seriesWinner cpsId cpsLive officials tournamentInfo eventActions startTimeInUKHHMM comment href tournamentSlug homeTeam.key homeTeam.scores.score homeTeam.scores.halfTime homeTeam.scores.fullTime homeTeam.scores.extraTime homeTeam.scores.shootout homeTeam.scores.aggregate homeTeam.scores.aggregateGoalsAway homeTeam.formation homeTeam.eventOutcome homeTeam.name.first homeTeam.name.full homeTeam.name.abbreviation homeTeam.name.last awayTeam.key awayTeam.scores.score awayTeam.scores.halfTime awayTeam.scores.fullTime awayTeam.scores.extraTime awayTeam.scores.shootout awayTeam.scores.aggregate awayTeam.scores.aggregateGoalsAway awayTeam.formation awayTeam.eventOutcome awayTeam.name.first awayTeam.name.full awayTeam.name.abbreviation awayTeam.name.last eventProgress.period eventProgress.status venue.name.abbreviation venue.name.videCode venue.name.first venue.name.full venue.homeCountry tournamentName.first tournamentName.full tournamentName.abbreviation
0 EFBO2257629 2021-11-10T17:00:00+00:00 False NaN NaN post-event Result None regular Regular None None None [] None None 17:00 None None international-friendlies TFBB11658 0 0 0 None None None None None loss Kosovo Kosovo Kosovo None TFBB1843 2 0 0 None None None None None win Jordan Jordan Jordan None FULLTIME RESULT Stadiumi Fadil Vokrri Stadiumi Fadil Vokrri Stadiumi Fadil Vokrri Stadiumi Fadil Vokrri None International Friendlies International Friendlies Int Friendlies
1 EFBO2245946 2021-11-10T17:45:00+00:00 False 90.0 3.0 post-event Result None regular Regular None None None [] None None 17:45 None None womens-champions-league TFBB9121 4 3 4 None None None None None win Barcelona Femenino Barcelona Femenino Barcelona Femenino None TFBB8706 0 0 0 None None None None None loss TSG 1899 Hoffenheim Ladies TSG 1899 Hoffenheim Ladies TSG 1899 Hoffenheim Ladies None FULLTIME RESULT Estadi Johan Cruyff Estadi Johan Cruyff Estadi Johan Cruyff Estadi Johan Cruyff Spain Women's Champions League UEFA Women's Champions League Women's Champs Lge
2 EFBO2245947 2021-11-10T17:45:00+00:00 False 90.0 4.0 post-event Result None regular Regular None 59214028 yes [] None None 17:45 None /sport/football/59214028 womens-champions-league TFBB16926 1 0 1 None None None None None loss HB Køge Women HB Køge Women HB Køge Women None TFBB6272 5 1 5 None None None None None win Arsenal Women Arsenal Women Arsenal Women None FULLTIME RESULT Capelli Sport Stadion Capelli Sport Stadion Capelli Sport Stadion Capelli Sport Stadion Denmark Women's Champions League UEFA Women's Champions League Women's Champs Lge
3 EFBO2245948 2021-11-10T20:00:00+00:00 False 90.0 6.0 post-event Result None regular Regular None None None [] None None 20:00 None None womens-champions-league TFBB16019 0 0 0 None None None None None loss Benfica Women Benfica Women Benfica Women None TFBB17702 1 0 1 None None None None None win BK Häcken Women BK Häcken Women BK Häcken Women None FULLTIME RESULT Caixa Futebol Campus Caixa Futebol Campus Caixa Futebol Campus Caixa Futebol Campus Portugal Women's Champions League UEFA Women's Champions League Women's Champs Lge
4 EFBO2245949 2021-11-10T20:00:00+00:00 False 90.0 4.0 post-event Result None regular Regular None None None [] None None 20:00 None None womens-champions-league TFBB9100 2 0 2 None None None None None win Lyon Féminines Lyon Féminines Lyon Féminines None TFBB6597 1 1 1 None None None None None loss FC Bayern München Ladies FC Bayern München Ladies FC Bayern München Ladies None FULLTIME RESULT Groupama Stadium Groupama Stadium Groupama Stadium Groupama Stadium France Women's Champions League UEFA Women's Champions League Women's Champs Lge
Now can just pull out the columns you want:
results = event_df[['homeTeam.name.full','homeTeam.eventOutcome','awayTeam.name.full','awayTeam.eventOutcome']]
Output:
print(results.to_string())
homeTeam.name.full homeTeam.eventOutcome awayTeam.name.full awayTeam.eventOutcome
0 Kosovo loss Jordan win
1 Barcelona Femenino win TSG 1899 Hoffenheim Ladies loss
2 HB Køge Women loss Arsenal Women win
3 Benfica Women loss BK Häcken Women win
4 Lyon Féminines win FC Bayern München Ladies loss
5 Walsall loss Forest Green Rovers win
6 Exeter City win Bristol Rovers loss
7 Wigan Athletic win Shrewsbury Town loss
8 Broomhill loss Bo'ness United win
9 Cumbernauld Colts loss Rangers B win
10 Dalbeattie Star loss East Kilbride win
11 East Stirlingshire tie University of Stirling tie
12 Grêmio win Fluminense loss
13 Athletico Paranaense win Ceará loss
14 Atlético Mineiro win Corinthians loss
15 Santos win Red Bull Bragantino loss
16 Palmeiras win Atlético Goianiense loss
EDIT: ADDITIONAL CODE
import pandas as pd
data = [['Liverpool','Manchester U'],
['Burnley' ,'West Ham'],
['Arsenal' ,'Chelsea'],
['Tottenham','Brentford']]
scraped_data = pd.DataFrame(data, columns = ['Home', 'Away'])
data = [['Liverpool', 'Manchester U', 'Liverpool'],
['TestTeamA', 'TestTeamB', 'TestTeamA'],
['Arsenal', 'Chelsea', 'Chelsea']]
excel_data = pd.DataFrame(data, columns = ['Home', 'Away', 'Result'])
result = scraped_data.merge(excel_data, how='left', on=['Home','Away']).dropna(subset=['Result'])
Output:
print(result)
Home Away Result
0 Liverpool Manchester U Liverpool
2 Arsenal Chelsea Chelsea
Upvotes: 1
Reputation: 1257
The table of this url is not so easy to manipulate, I suggest using selenium:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument('--no-default-browser-check')
options.add_argument('--log-level=3')
options.add_argument('--headless')
service = Service('chromedriver.exe')
driver = webdriver.Chrome(options=options, service=service)
date = input(str("Input the date, hit Enter for today: "))
driver.get("https://www.bbc.com/sport/football/scores-fixtures/" + date)
games = driver.find_elements(By.XPATH, "//article[@data-event-id]")
ListHome = []
ListAway = []
ListTime = []
for game in games:
ListHome.append(game.find_element(By.XPATH,".//span[contains(@class,'team--time-home')]").text)
ListAway.append(game.find_element(By.XPATH,".//span[contains(@class,'team--time-away')]").text)
df = pd.DataFrame(list(zip(ListHome, ListAway)), columns =['Home', 'Away'])
df.to_excel('name.xlsx', index=False)
Use pd.merge to compare with your other spreadsheet.
Upvotes: 0