Nicole C. Baratta
Nicole C. Baratta

Reputation: 97

Grabbing data from td tags with python and beautifulsoup

I'm a beginner with Python and am working through some tasks with data I'm familiar with to learn the basics. I'm trying to crawl through a table to gather contact information but having issues getting at the data in a list of tds.

The HTML looks like this:

<table class="table table-striped" data-drupal-selector="edit-directory" id="edit-directory--zJwP9mT4moQ">
   <thead>
   <tr>
       <th>Name</th>
       <th>Job Title</th>
       <th>Campus/Department</th>
       <th>Contact</th>
   </tr>
   </thead>
   <tbody>
   <tr class="odd">
       <td>LAST, FIRST</td>
       <td>T-HS SCI- GEN'L</td>
       <td><span tabindex="0">SCHOOL</span></td>
       <td><a href="mailto:[email protected]" class="email"><span aria-hidden="true">Email</span><span class="sr-only">[email protected]</span></a><br>555-555-5555</td>
   </tr>
</table>

I have this code to get the table

data = urllib.parse.urlencode(params).encode("utf-8")
    req = urllib.request.Request(url)
    with urllib.request.urlopen(req,data=data) as f:
        soup = bs(f, 'html.parser')

table = soup.find("table")

for row in table.findAll("tr"):
        #print (row)
        cells = row.findAll("td")
        print(cells) 

I get something like this:

[<td>LAST,FIRST </td>, <td>TEMP PROF</td>, <td><span tabindex="0">SCHOOL</span></td>, <td><a class="email" href="mailto:[email protected]"><span aria-hidden="true">Email</span><span class="sr-only">[email protected]</span></a><br/>555-555-5555</td>]

[<td><a href="https://teachersite.com" target="_blank">LAST, FIRST</a></td>, <td>T-ENGLISH</td>, <td><span tabindex="0">SCHOOL</span></td>, <td><a class="email" href="mailto:[email protected]"><span aria-hidden="true">Email</span><span class="sr-only">[email protected]/span></a><br/>555-555-5555</td>]

But if I try to then get at the data in the list:

print (cells[1]) 

It says the index is out of range

What I'm trying to get is something like this:

last = 'LAST'
first = 'FIRST'
email = '[email protected]'
title = 'TEMP PROF'
phone = '555-555-5555'

Upvotes: 3

Views: 717

Answers (2)

Ajax1234
Ajax1234

Reputation: 71451

You can iterate over the tds for each tr and grab the data you need:

from bs4 import BeautifulSoup as soup
def scrape_td(d):
  n, t, _, c = d.find_all('td')
  return {**dict(zip(['last', 'first'], n.text.split(', '))), 'title':t.text, 'email':c.contents[0]['href'][7:], 'phone':c.contents[-1]}

results = list(map(scrape_td, soup(html, 'html.parser').find('table', {'id':'edit-directory--zJwP9mT4moQ'}).find_all('tr')[1:]))

Output:

[{'last': 'LAST', 'first': 'FIRST', 'title': "T-HS SCI- GEN'L", 'email': '[email protected]', 'phone': '555-555-5555'}]

Upvotes: 0

jkulskis
jkulskis

Reputation: 124

It seems like you want to strip the text from each element:

for row in table.findAll('tr'):
    cols = row.findAll('td')
    cols = [element.text.strip() for element in cols]
    for col in cols:
        print(col)

For finding the first and last name, you can split the first element by the comma and space with: .split(', '). Hopefully this points you in the right direction!

Upvotes: 1

Related Questions