Reputation: 4180
This is the example HTML.
<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>
I am in a situation where I need to extract
Chamber of Secrets
Prisoners in Azkabahn
I am using lxml 4.2.1 in python which uses xpathb1.0. I have tried to extract using XPath
'substring-after(//a/@href,"HarryPotter:")'
which returns only "Chamber of Secrets"
.
and with XPath
'//a/@href[substring-after(.,"HarryPotter:")]'
which returns
'HarryPotter:Chamber of Secrets'
'HarryPotter:Prisoners in Azkabahn'
I have researched for it and got new learning but didn't find the fix of my problem.
I have hit and tried different XPath using substring-after
.
In my research, I got to know that it could also be accomplished by regex too, then I tried and failed.
I found that it is easy to manipulate a string in XPath 2.0 and above using regex but we can also use regex in XPath 1.0 using XSLT extensions.
Could we do it with substring-after
function, if yes then what is the XPath and if No then what is the best approach to get the desired output?
And how we can get the desired output using regex in XPath by sticking to lxml.
Upvotes: 1
Views: 1987
Reputation: 817
If you want to use substring-after()
and substring-before()
and together
Here is example:
from lxml import html
f_html = """<html><body><table><tbody><tr><td class="df9" width="20%">
<a class="nodec1" href="javascript:reqDl(1254);" onmouseout="status='';" onmouseover="return dspSt();">
<u>
2014-2
</u>
</a>
</td></tr></tbody></table></body></html>"""
tree_html = html.fromstring(f_html)
deal_id = tree_html.xpath("//td/a/@href")
print(tree_html.xpath('substring-after(//td/a/@href, "javascript:reqDl(")'))
print(tree_html.xpath('substring-before(//td/a/@href, ")")'))
print(tree_html.xpath('substring-after(substring-before(//td/a/@href, ")"), "javascript:reqDl(")'))
Result:
1254);
javascript:reqDl(1254
1254
Upvotes: 0
Reputation: 52665
Try this approach to get both text values:
from lxml import html
raw_source = """<html>
<a href="HarryPotter:Chamber of Secrets">
text
</a>
<a href="HarryPotter:Prisoners in Azkabahn">
text
</a>
</html>"""
source = html.fromstring(raw_source)
for link in source.xpath('//a'):
print(link.xpath('substring-after(@href, "HarryPotter:")'))
Upvotes: 1