Reputation: 1604
I am trying to extract hrefs from the first child of td tags with the class foo. An example DOM is:
<td class="foo">
<a href="www.foobar1.com"></a>
</td>
<td class="foo">
<a href="www.foobar2.com"></a>
</td>
From this I would like to get ["www.foobar1.com", "www.foobar2.com"]
So far I have the following:
import requests
from lxml import html
def get_hrefs(url):
page = requests.get(url)
tree = html.fromstring(page.text)
td_elements = tree.xpath('//td[@class="foo"]')
return [el.find("a").attrib["href"] for el in td_elements]
However, I feel like it would be more efficient to extend the xpath instead of doing the iteration, but not sure how to construct it.
Thank you.
Upvotes: 1
Views: 475
Reputation: 473893
Yes, you can simplify it by getting the @href
from the a
tag inside each td
:
return tree.xpath('//td[@class="foo"]/a/@href')
Upvotes: 1