Reputation: 7997
I'm trying to scrape some sports data from a website using Beautifulsoup4, but am having some trouble figuring out how to proceed. I'm not that great with HTML, and can't seem to figure out the last bit of syntax necessary. Once the data is parsed, I'm going to plug it into a Pandas dataframe. I'm trying to extract the home team, away team, and score. Here's my code so far:
from bs4 import BeautifulSoup
import urllib2
import csv
url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
def has_class_but_no_id(tag):
return tag.has_attr('score')
writer = csv.writer(open("webScraper.csv", "w"))
for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
print(tag)
here's a sample output:
<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>
I need to store the home team (Newcastle), the score (0-3) and the away team (Sunderland) in three separate fields. Essentially, I'm stuck trying to extract the "value" from each tag, and can't seem to figure out the syntax in bs4
. I need like a tag.value
property, but all I have found in the documentation is a tag.name
or tag.attrs
. Any help or pointers would be greatly appreciated!
Upvotes: 3
Views: 1100
Reputation: 3061
due to a redirect to here: https://www.bbc.com/sport/football/premier-league/scores-fixtures
This is an update to the accepted answer, which is still correct. ping me if you edit your answer and i will delete this answer.
for match in soup.find_all('article', class_='sp-c-fixture'):
home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
home = home_tag and ''.join(home_tag.stripped_strings)
score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
score = score_tag and ''.join(score_tag.stripped_strings)
away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
away = away_tag and ''.join(away_tag.stripped_strings)
if home and score and away:
print(home, score, away)
Upvotes: 0
Reputation: 1124548
Each score unit is located inside a <td class='match-details'>
element, loop over those to extract match details.
From there, you can extract the text from children elements using the .stripped_strings
generator; just pass it to ''.join()
to get all strings contained in a tag. Pick team-home
, score
and team-away
separately for ease of parsing:
for match in soup.find_all('td', class_='match-details'):
home_tag = match.find('span', class_='team-home')
home = home_tag and ''.join(home_tag.stripped_strings)
score_tag = match.find('span', class_='score')
score = score_tag and ''.join(score_tag.stripped_strings)
away_tag = match.find('span', class_='team-away')
away = away_tag and ''.join(away_tag.stripped_strings)
With an additional print
this gives:
>>> for match in soup.find_all('td', class_='match-details'):
... home_tag = match.find('span', class_='team-home')
... home = home_tag and ''.join(home_tag.stripped_strings)
... score_tag = match.find('span', class_='score')
... score = score_tag and ''.join(score_tag.stripped_strings)
... away_tag = match.find('span', class_='team-away')
... away = away_tag and ''.join(away_tag.stripped_strings)
... if home and score and away:
... print home, score, away
...
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.
Upvotes: 3
Reputation: 180
You can use tag.string propery to get value of tag.
Refer to the documentation for more details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Upvotes: 1