Regex in lxml.xpath

Question

I am trying to create a function which returns names from a websites "Our Team" page with an xpath. Most of the time this could be done by building up the xpath with classes and then they could be grabbed in one go. In some cases however even though classes are used, they are not the same. For example, here is the xpath for 2 people on the same page:

//html/body/div[contains(@class,"el13")]/div[contains(@class,"el22")]/div[contains(@class, "el23")]/text() //html/body/div[contains(@class,"el3")]/div[contains(@class,"el34")]/div[contains(@class, "el77")]/text()

Is there a way to use tree.xpath in a way where I could give it 1 xpath containing regex? \d+ means one or more digits. Is there a way for tree.xpath to grab all the names as usual into a list with something like this?

//html/body/div[contains(@class,"el\d+")]/div[contains(@class,"el\d+")]/div[contains(@class, "el\d+")]/text()

I read in the documentation that the lxml library supports the EXSLT regex library, however I am not familiar with how I could implement that in a way described above. I also use the regular regex library a lot in other parts so importing it could mess things up (at least as how I understand). More info on it here: https://lxml.de/xpathxslt.html

This is the part of my code which does this currently:

content = requests.get("url of the page")
if content.status_code == 200:
    tree = html.fromstring(content)
    names = tree.xpath("the xpath to the names")

    # names returns something like ["John Smith", "Jane Smith", "Harry Cobbler"]

Alexandra Dudkina · Accepted Answer

from lxml import etree as et

tree = et.fromstring(xml)

# define exslt namespace
reNS = "http://exslt.org/regular-expressions"
# prepare xpath with regexp
find = et.XPath("//div[re:test(@class, '^el\d+$', 'i')]", namespaces={'re':reNS})
# evaluate xpath
names = find.evaluate(tree)

Here is XPath predicate [re:test(@class, '^el\d+$', 'i')], it uses EXSLT test function. First parameter is class attribute, second parameter - regular expression, third parameter i flag for case insensitivity.

Your XPath will look like:

//html/body/div[re:test(@class,"el\d+", "i")]/div[re:test(@class,"el\d+", "i")]/div[re:test(@class,"el\d+", "i")]/text()

Regex in lxml.xpath

Answers (1)

Related Questions