Reputation: 135
I have a page with some tables in its source:
<td class="ng-binding">9:20 AM</td>,
<td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
<td class="ng-binding">1:05 PM</td>,
<td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
<td class="ng-binding">1:15 PM</td>,
<td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
<td class="ng-binding">9:20 AM</td>,
<td class="ng-binding"><span class="ng-binding" ng-bind-html="objFlight.flight.statusMessage.text | unsafe">Scheduled</span> </td>,
<td class="ng-binding" colspan="7">* All times are in local timezone</td>
I would like to get time from this page:
9:20 AM
1:05 PM
1:15 PM
9:20 AM
However, my code:
times=soup.find_all('td',{'class':'ng-binding'})
for time in times:
a = time.text.strip()
print(a)
--------------------------------------------------------
9:20 AM
Scheduled
1:05 PM
Scheduled
1:15 PM
Scheduled
9:20 AM
Scheduled
* All times are in local timezone
How to solve this and get my expected output from the page? Thanks
Upvotes: 0
Views: 391
Reputation: 378
This is a solution using htql:
>>> import htql
>>> results = htql.query(html, "<td (tx =~ '\\d.*')>:tx ")
>>> results
[('9:20 AM',), ('1:05 PM',), ('1:15 PM',), ('9:20 AM',)]
Upvotes: 0
Reputation: 462
An integrated way would be to apply conditions while getting the tags, this can be done in at least two ways. In both ways we can substitute the tag name in find_all with a function that applies these extra conditions:
def is_td_without_span(tag):
return tag.name == "td" and not tag.find("span")
times = soup.find_all(is_td_without_span,{'class':'ng-binding'})
import re
regex = r"\d{1,2}:\d{2} AM|PM" # hour can omit leading 0, minutes can not
def is_td_with_time:
return tag.name == "td" and re.search(regex, tag.text) is not None
times = soup.find_all(is_td_with_time,{'class':'ng-binding'})
Upvotes: 1
Reputation: 84465
If it is as you show (I suspect you may need to add another anchor of some sort), you can use nth-child(odd)
and then filter out the td
with colspan
[i.text for i in soup.select('td:nth-child(odd):not([colspan])')]
Without seeing more HTML, and regarding your follow-up comment, your current list can be filtered in advance with .endswith
(unsure how reliable given limited HTML)
[i.text for i in soup.select('td:nth-child(odd):not([colspan])') if i.text.endswith((' AM', ' PM'))]
Upvotes: 2