Arhama
Arhama

Reputation: 75

XPath to the children as well as "text children"

<li>
    <b>word</b>
    <i>type</i>
    <b>1.</b>
    "translation 1"
    <b>2.</b>
    "translation 2"           
</li>

I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.

How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4

I would like to get all children for further analysis

Upvotes: 1

Views: 132

Answers (3)

Mads Hansen
Mads Hansen

Reputation: 66781

Elements *, comment(), text(), and processing-instruction() are all nodes.

To select all nodes:

.//node()

To ensure that it's only selecting * and text() you can add a predicate filter:

.//node()[self::* or self::text()]

However, the Selenium method is find_element() (and there is find_elements()) and they expect to locate elements and not text(). It seems that there isn't a more generic method to find nodes, so you may need to write some code to achieve what you want, such as JaSON answer.

Upvotes: 1

JaSON
JaSON

Reputation: 4869

If you want to get all children (including descendant) text nodes from li node you can try this code

from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement

driver = webdriver.Chrome()
driver.get(<URL>)

li = driver.find_element('xpath', '//li')
nodes = driver.execute_script("return arguments[0].childNodes", li)
text_nodes = []
for node in nodes:
    if not isinstance(node, WebElement):  # Extract text from direct child text nodes
        _text = node['textContent'].strip()
        if _text:  # Ignore all the empty text nodes
            text_nodes.append(_text)
    else:  # Extract text from WebElements like <b>, <i>... 
        text_nodes.append(node.text)

print(text_nodes)

Output:

['word', 'type', '1.', '"translation 1"', '2.', '"translation 2"']

Upvotes: 2

Conal Tuohy
Conal Tuohy

Reputation: 3258

I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.

So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.

I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.

Upvotes: 1

Related Questions