Reputation: 325
I'm trying to scrape a few schedule tables from ESPN: http://www.espn.com/nba/schedule/_/date/20171001
import requests
import bs4
response = requests.get('http://www.espn.com/nba/schedule/_/date/20171001')
soup = bs4.BeautifulSoup(response.text, 'lxml')
print soup.prettify()
table = soup.find_all('table')
data = []
for i in table:
rows = i.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col for col in cols if col])
My code works fine except the output is missing the date info:
[
"Phoenix PHX",
"Utah UTAH",
"394 tickets available from $6"
],
[],
[
"Miami MIA",
"Orlando ORL",
"1,582 tickets available from $12"
]
After some investigation, I realized that the date and time information is wrapped within the tags like so:
<td data-behavior="date_time" data-date="2017-10-07T23:00Z"><a data-dateformat="time1" href="/nba/game?gameId=400978807" name="&lpos=nba:schedule:time"></a></td>
I see this on other websites from time to time as well but never really understood why they do it this way. How can I extract text inside an open tag to get the "2017-10-07T23:00Z" in my output?
Upvotes: 0
Views: 115
Reputation: 141998
Some td
tags in that table contain attributes. You can access a td
's attributes by calling attrs()
which returns a dict
:
>>> td = soup.select('tr')[1].select('td')[2]
>>> td
<td data-behavior="date_time" data-date="2017-10-01T22:00Z"><a data-dateformat="time1" href="/nba/game?gameId=400978817" name="&lpos=nba:schedule:time"></a></td>
>>> td.attrs
{'data-date': '2017-10-01T22:00Z', 'data-behavior': 'date_time'}
>>> td.attrs['data-date']
'2017-10-01T22:00Z'
To that end, you can create a function that returns the date if that attribute is present or just return the text for a td
:
import requests
import bs4
def date_or_text(td):
if 'data-date' in td.attrs:
return td.attrs['data-date']
return td.text
def extract_game_information(tr):
tds_with_blanks = (date_or_text(td) for td in tr.select('td'))
return [data for data in tds_with_blanks if data]
response = requests.get('http://www.espn.com/nba/schedule/_/date/20171001')
soup = bs4.BeautifulSoup(response.text, 'lxml')
data = [extract_game_information(tr) for tr in soup.select('tr')]
Upvotes: 1
Reputation: 4069
attrs property contains a dictionary of attributes which you can utilize to fetch values,I have added a length check as some empty rows are present.
Can you try modifying the for loop as below:
for i in table:
rows = i.find_all('tr')
for row in rows:
cols = row.find_all('td')
date_data = None
if len(cols) > 2:
date_data = cols[2].attrs['data-date']
cols = [col.text.strip() for col in cols]
dat = [col for col in cols if col]
if date_data:
dat.append(date_data)
data.append(dat)
PS: the above snippet can be optimized :-)
Upvotes: 1