Reputation: 2687
Trying to grab all the names of high schools from the list of high schools in nyc wiki page.
I've written enough of the script to get me all of the info contained within the <tr>
tags of the table containing the list of high schools, academic area and entrance criteria - but how can I narrow that down to what I thought would rest within td[0]
(which spits back a KeyError
) - just the name of the school?
Code I've written thus far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
NYC = 'https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City'
html = urlopen(NYC)
soup = BeautifulSoup(html.read(), 'lxml')
schooltable = soup.find('table')
for td in schooltable:
print(td)
Output I receive:
<tr>
<td><a href="/wiki/The_Beacon_School" title="The Beacon School">The Beacon School</a></td>
<td>Humanities & interdisciplinary</td>
<td>Academic record, interview</td>
</tr>
Output I'm seeking:
The Beacon School
Upvotes: 4
Views: 9523
Reputation: 26578
I also managed to do this by looking for all the anchors inside <td>
and then looking for title:
titles = next(
i.get('title') for i in [
td.find('a') for td in soup.findAll('td') if td.find('a') is not None
]
Upvotes: 1
Reputation: 473833
How about you get the first table
on the page, iterate over all rows, except the first header one, and get the first td
element for every row. Works for me:
for row in soup.table.find_all('tr')[1:]:
print(row.td.text)
Upvotes: 9