n1c9
n1c9

Reputation: 2687

How to grab a specific <td> within a <tr> with BeautifulSoup

Trying to grab all the names of high schools from the list of high schools in nyc wiki page.

I've written enough of the script to get me all of the info contained within the <tr> tags of the table containing the list of high schools, academic area and entrance criteria - but how can I narrow that down to what I thought would rest within td[0] (which spits back a KeyError) - just the name of the school?

Code I've written thus far:

from bs4 import BeautifulSoup
from urllib2 import urlopen

NYC = 'https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City'

html = urlopen(NYC)
soup = BeautifulSoup(html.read(), 'lxml')
schooltable = soup.find('table')
for td in schooltable:
    print(td)

Output I receive:

<tr>
    <td><a href="/wiki/The_Beacon_School" title="The Beacon School">The Beacon School</a></td>
    <td>Humanities &amp; interdisciplinary</td>
    <td>Academic record, interview</td>
</tr>

Output I'm seeking:

The Beacon School

Upvotes: 4

Views: 9523

Answers (2)

idjaw
idjaw

Reputation: 26578

I also managed to do this by looking for all the anchors inside <td> and then looking for title:

titles = next(
    i.get('title') for i in [
        td.find('a') for td in soup.findAll('td') if td.find('a') is not None
        ]

Upvotes: 1

alecxe
alecxe

Reputation: 473833

How about you get the first table on the page, iterate over all rows, except the first header one, and get the first td element for every row. Works for me:

for row in soup.table.find_all('tr')[1:]:
    print(row.td.text)

Upvotes: 9

Related Questions