brainLoop
brainLoop

Reputation: 4180

How to get substring from string using xpath 1.0 in lxml

This is the example HTML.

<html>
  <a href="HarryPotter:Chamber of Secrets">
    text
  </a>
  <a href="HarryPotter:Prisoners in Azkabahn">
    text
  </a>
</html>

I am in a situation where I need to extract

Chamber of Secrets
Prisoners in Azkabahn 

I am using lxml 4.2.1 in python which uses xpathb1.0. I have tried to extract using XPath

'substring-after(//a/@href,"HarryPotter:")' 

which returns only "Chamber of Secrets".

and with XPath

'//a/@href[substring-after(.,"HarryPotter:")]' 

which returns

'HarryPotter:Chamber of Secrets'
'HarryPotter:Prisoners in Azkabahn'

I have researched for it and got new learning but didn't find the fix of my problem.

I have hit and tried different XPath using substring-after.

In my research, I got to know that it could also be accomplished by regex too, then I tried and failed.

I found that it is easy to manipulate a string in XPath 2.0 and above using regex but we can also use regex in XPath 1.0 using XSLT extensions.

Could we do it with substring-after function, if yes then what is the XPath and if No then what is the best approach to get the desired output?

And how we can get the desired output using regex in XPath by sticking to lxml.

Upvotes: 1

Views: 1987

Answers (2)

drt
drt

Reputation: 817

If you want to use substring-after() and substring-before() and together

Here is example:

from lxml import html

f_html = """<html><body><table><tbody><tr><td class="df9" width="20%">
         <a class="nodec1" href="javascript:reqDl(1254);" onmouseout="status='';" onmouseover="return dspSt();">
          <u>
           2014-2
          </u>
         </a>
        </td></tr></tbody></table></body></html>"""
tree_html = html.fromstring(f_html)
deal_id = tree_html.xpath("//td/a/@href")
print(tree_html.xpath('substring-after(//td/a/@href, "javascript:reqDl(")'))
print(tree_html.xpath('substring-before(//td/a/@href, ")")'))
print(tree_html.xpath('substring-after(substring-before(//td/a/@href, ")"), "javascript:reqDl(")'))

Result:

1254);
javascript:reqDl(1254
1254

Upvotes: 0

Andersson
Andersson

Reputation: 52665

Try this approach to get both text values:

from lxml import html

raw_source = """<html>
  <a href="HarryPotter:Chamber of Secrets">
    text
  </a>
  <a href="HarryPotter:Prisoners in Azkabahn">
    text
  </a>
</html>"""
source = html.fromstring(raw_source)

for link in source.xpath('//a'):
    print(link.xpath('substring-after(@href, "HarryPotter:")'))

Upvotes: 1

Related Questions