Reputation: 1
How do I get the href value for the a in this snippet of html?
I need to get it based on that class in i tag
<!--
<a href="https://link.com" target="_blank"><i class="foobar"></i> </a>
-->
I tried this, but am getting no results
foo_links = tree.xpath('//a[i/@class="foobar"]')
Upvotes: 0
Views: 541
Reputation: 523284
Your code does work for me — it returns a list of <a>
. If you want a list of href
s not the element itself, add /@href
:
hrefs = tree.xpath('//a[i/@class="foobar"]/@href')
You could also first find the <i>
s, then use /parent::*
(or simply /..
) to get back to the <a>
s.
hrefs = tree.xpath('//a/i[@class="foobar"]/../@href')
# ^ ^ ^
# | | obtain the 'href'
# | |
# | get the parent of the <i>
# |
# find all <i class="foobar"> contained in an <a>.
If all of these don't work, you may want to verify if the structure of the document is correct.
Note that XPath won't peek inside comments <!-- -->
. If the <a>
is indeed inside the comments <!-- -->
, you need to manually extract the document out first.
hrefs = [href for comment in tree.xpath('//comment()')
# find all comments
for href in lxml.html.fromstring(comment.text)
# parse content of comment as a new HTML file
.xpath('//a[i/@class="foobar"]/@href')
# read those hrefs.
]
Upvotes: 1
Reputation: 52665
You should note that target element is HTML
comment. You cannot simply get <a>
from comment with XPath
like "//a"
as in this case it's not a node, but simple string.
Try below code:
import re
foo_links = tree.xpath('//comment()') # get list of all comments on page
for link in foo_links:
if '<i class="foobar">' in link.text:
href = re.search('\w+://\w+.\w+', link.text).group(0) # get href value from required comment
break
P.S. You might need to use more complex regular expression to match link URL
Upvotes: 0