Reputation:
I am trying to extract data from the html of the following site:
http://www.irishrugby.ie/guinnesspro12/results_and_fixtures_pro_12_section.php
I want to be able to extract the team names and the score for example the first fixture is Connacht vs Newport Gwent Dragons
.
I want my python program too print the result, i.e Connacht Rugby 29 - 23 Newport Gwent Dragons
.
Here is the html I want too extract it from:
<!-- 207974 sfms -->
<tr class="odd match-result group_celtic_league" id="fixturerow0" onclick="if( c
lickpriority == 0 ) { redirect('/guinnesspro12/35435.php') }" onmouseout="classN
ame='odd match-result group_celtic_league';" onmouseover="clickpriority=0; class
Name='odd match-result group_celtic_league rollover';" style="">
<td class="field_DateShort" style="">
Fri 4 Sep
</td>
<td class="field_TimeLong" style="">
19:30
</td>
<td class="field_CompStageAbbrev" style="">
PRO12
</td>
<td class="field_LogoTeamA" style="">
<img alt="Connacht Rugby" height="50" src="http://cdn.soticservers.net/tools/i
mages/teams/logos/50x50/16.png" width="50"/>
</td>
<td class="field_HomeDisplay" style="">
Connacht Rugby
</td>
<td class="field_Score" style="">
29 - 23
</td>
<td class="field_AwayDisplay" style="">
Newport Gwent Dragons
</td>
<td class="field_LogoTeamB" style="">
<img alt="Newport Gwent Dragons" height="50" src="http://cdn.soticservers.net/
tools/images/teams/logos/50x50/19.png" width="50"/>
</td>
<td class="field_HA" style="">
H
</td>
<td class="field_OppositionDisplay" style="">
<br/>
</td>
<td class="field_ResScore" style="">
W 29-23
</td>
<td class="field_VenName" style="">
Sportsground
</td>
<td class="field_BroadcastAttend" style="">
3,624
</td>
<td class="field_Links" style="">
<a href="/guinnesspro12/35435.php" onclick="clickpriority=1">
Report
</a>
</td>
</tr>
This is my program so far:
from httplib2 import Http
from bs4 import BeautifulSoup
# create a "web object"
h = Http()
# Request the specified web page
response, content = h.request('http://www.irishrugby.ie/guinnesspro12/results_and_fixtures_pro_12_section.php')
# display the response status
print(response.status)
# display the text of the web page
print(content.decode())
soup = BeautifulSoup(content)
# check the response
if response.status == 200:
#print(soup.get_text())
rows = soup.find_all('tr')[1:-2]
for row in rows:
data = row.find_all('td')
#print(data)
else:
print('Unable to connect:', response.status)
print(soup.get_text())
Upvotes: 1
Views: 104
Reputation: 46779
Here is another possible solution:
import requests
from bs4 import BeautifulSoup
req_cols = ['field_HomeDisplay', 'field_Score', 'field_AwayDisplay']
html = requests.get('http://www.irishrugby.ie/guinnesspro12/results_and_fixtures_pro_12_section.php')
if html.status_code == 200:
soup = BeautifulSoup(html.text)
fixtures = soup.find('table', class_='fixtures')
for row in fixtures.find_all('tr'):
try:
print(' '.join(row.find('td', class_=col).get_text() for col in req_cols))
except:
pass
This would give you the following type of output:
Connacht Rugby 29 - 23 Newport Gwent Dragons
Connacht Rugby 29 - 23 Newport Gwent Dragons
Edinburgh Rugby 16 - 9 Leinster Rugby
Ulster Rugby 28 - 6 Ospreys
Munster Rugby 18 - 13 Benetton Treviso
Glasgow Warriors 33 - 32 Connacht Rugby
.
.
.
Connacht Rugby v Glasgow Warriors
Leinster Rugby v Benetton Treviso
Munster Rugby v Scarlets
Ospreys v Ulster Rugby
v
v
v
Upvotes: 0
Reputation: 40773
There is an easy way: downloading the .ics (calendar file) and parse. The calendar file is just a text file and the format is easier to parse and it is available here. If you look at the top of your original page, you will see a link to download the .ics file.
The code will be really simple since each record in the .ics file looks like this:
BEGIN:VEVENT
DTSTAMP:20151103T141214Z
DTSTART:20150904T183000Z
DTEND:20150904T183000Z
SUMMARY:Connacht Rugby 29 - 23 Newport Gwent Dragons
DESCRIPTION:Competition - Guinness PRO12
LOCATION:Sportsground
URL:/rugby/match_centre.php?section=overview&fixid=207974
UID:20150904T183000ZConnacht Rugby 29 - 23 Newport Gwent Dragons
END:VEVENT
All you want is to look for lines that starts with SUMMARY:
and remove that tag:
from httplib2 import Http
def extract_summary(content):
for line in content.splitlines():
if line.startswith('SUMMARY:'):
line = line.replace('SUMMARY:', '')
yield line
if __name__ == '__main__':
h = Http()
response, content = h.request('http://www.irishrugby.ie/tools/calendars/irfu-irfu-guinnesspro12.ics?v=144656')
content = content.decode()
for line in extract_summary(content):
print(line)
Upvotes: 0
Reputation: 9048
Instead of finding all the <td>
tags you should be more specific. I would convert this:
for row in rows:
data = row.find_all('td')
to this:
for row in rows:
home = row.find("td",attrs={"class":"field_HomeDisplay")
score = row.find("td",attrs={"class":"field_Score")
away = row.find("td",attrs={"class":"field_AwayDisplay")
print(home.get_text() + " " + score.get_text() + " " + away.get_text())
Upvotes: 1