flyingmeatball
flyingmeatball

Reputation: 7997

Python Beautifulsoup4 website parsing

I'm trying to scrape some sports data from a website using Beautifulsoup4, but am having some trouble figuring out how to proceed. I'm not that great with HTML, and can't seem to figure out the last bit of syntax necessary. Once the data is parsed, I'm going to plug it into a Pandas dataframe. I'm trying to extract the home team, away team, and score. Here's my code so far:

from bs4 import BeautifulSoup
import urllib2
import csv

url = 'http://www.bbc.com/sport/football/premier-league/results'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)

def has_class_but_no_id(tag):
    return tag.has_attr('score')

writer = csv.writer(open("webScraper.csv", "w"))

for tag in soup.find_all('span', {'class':['team-away', 'team-home', 'score']}):
    print(tag)

here's a sample output:

<span class="team-home teams">
<a href="/sport/football/teams/newcastle-united">Newcastle</a> </span>
<span class="score"> <abbr title="Score"> 0-3 </abbr> </span>
<span class="team-away teams">
<a href="/sport/football/teams/sunderland">Sunderland</a> </span>

I need to store the home team (Newcastle), the score (0-3) and the away team (Sunderland) in three separate fields. Essentially, I'm stuck trying to extract the "value" from each tag, and can't seem to figure out the syntax in bs4. I need like a tag.value property, but all I have found in the documentation is a tag.name or tag.attrs. Any help or pointers would be greatly appreciated!

Upvotes: 3

Views: 1100

Answers (3)

Anonymous Type
Anonymous Type

Reputation: 3061

due to a redirect to here: https://www.bbc.com/sport/football/premier-league/scores-fixtures

This is an update to the accepted answer, which is still correct. ping me if you edit your answer and i will delete this answer.

for match in soup.find_all('article', class_='sp-c-fixture'):
    home_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-home').find('span').find('span')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='sp-c-fixture__number sp-c-fixture__number--time')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='sp-c-fixture__team sp-c-fixture__team--time sp-c-fixture__team--time-away').find('span').find('span')
    away = away_tag and ''.join(away_tag.stripped_strings)
    if home and score and away:
        print(home, score, away)

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1124548

Each score unit is located inside a <td class='match-details'> element, loop over those to extract match details.

From there, you can extract the text from children elements using the .stripped_strings generator; just pass it to ''.join() to get all strings contained in a tag. Pick team-home, score and team-away separately for ease of parsing:

for match in soup.find_all('td', class_='match-details'):
    home_tag = match.find('span', class_='team-home')
    home = home_tag and ''.join(home_tag.stripped_strings)
    score_tag = match.find('span', class_='score')
    score = score_tag and ''.join(score_tag.stripped_strings)
    away_tag = match.find('span', class_='team-away')
    away = away_tag and ''.join(away_tag.stripped_strings)

With an additional print this gives:

>>> for match in soup.find_all('td', class_='match-details'):
...     home_tag = match.find('span', class_='team-home')
...     home = home_tag and ''.join(home_tag.stripped_strings)
...     score_tag = match.find('span', class_='score')
...     score = score_tag and ''.join(score_tag.stripped_strings)
...     away_tag = match.find('span', class_='team-away')
...     away = away_tag and ''.join(away_tag.stripped_strings)
...     if home and score and away:
...         print home, score, away
... 
Newcastle 0-3 Sunderland
West Ham 2-0 Swansea
Cardiff 2-1 Norwich
Everton 2-1 Aston Villa
Fulham 0-3 Southampton
Hull 1-1 Tottenham
Stoke 2-1 Man Utd
Aston Villa 4-3 West Brom
Chelsea 0-0 West Ham
Sunderland 1-0 Stoke
Tottenham 1-5 Man City
Man Utd 2-0 Cardiff
# etc. etc. etc.

Upvotes: 3

Anish Sheela
Anish Sheela

Reputation: 180

You can use tag.string propery to get value of tag.

Refer to the documentation for more details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Upvotes: 1

Related Questions