winnie
winnie

Reputation: 135

How to get value from td in BeautifulSoup?

I have a page with some tables in its source:

<td class="ng-binding">9:20 AM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">1:05 PM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">1:15 PM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding">9:20 AM</td>,
 <td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
 <td class="ng-binding" colspan="7">* All times are in local timezone</td>

I would like to get time from this page:

9:20 AM
1:05 PM
1:15 PM
9:20 AM

However, my code:

times=soup.find_all('td',{'class':'ng-binding'})
for time in times:
    a = time.text.strip()
    print(a)

--------------------------------------------------------

9:20 AM
Scheduled
1:05 PM
Scheduled
1:15 PM
Scheduled
9:20 AM
Scheduled
* All times are in local timezone

How to solve this and get my expected output from the page? Thanks

Upvotes: 0

Views: 391

Answers (3)

seagulf
seagulf

Reputation: 378

This is a solution using htql:

>>> import htql
>>> results = htql.query(html, "<td (tx =~ '\\d.*')>:tx ")
>>> results
[('9:20 AM',), ('1:05 PM',), ('1:15 PM',), ('9:20 AM',)]

Upvotes: 0

ankurbohra04
ankurbohra04

Reputation: 462

An integrated way would be to apply conditions while getting the tags, this can be done in at least two ways. In both ways we can substitute the tag name in find_all with a function that applies these extra conditions:

  1. Filter out the td tags with a span in them:
def is_td_without_span(tag):
    return tag.name == "td" and not tag.find("span")

times = soup.find_all(is_td_without_span,{'class':'ng-binding'})
  1. Filter out td tags with non-matching text using a regex:
import re
regex = r"\d{1,2}:\d{2} AM|PM" # hour can omit leading 0, minutes can not
def is_td_with_time:
    return tag.name == "td" and re.search(regex, tag.text) is not None

times = soup.find_all(is_td_with_time,{'class':'ng-binding'})

Upvotes: 1

QHarr
QHarr

Reputation: 84465

If it is as you show (I suspect you may need to add another anchor of some sort), you can use nth-child(odd) and then filter out the td with colspan

[i.text for i in soup.select('td:nth-child(odd):not([colspan])')]

Without seeing more HTML, and regarding your follow-up comment, your current list can be filtered in advance with .endswith (unsure how reliable given limited HTML)

[i.text for i in soup.select('td:nth-child(odd):not([colspan])') if i.text.endswith((' AM', ' PM'))]

Upvotes: 2

Related Questions