Reputation: 9548

How can I get text of an element in Selenium WebDriver, without including child element text?

Consider:

<div id="a">This is some
   <div id="b">text</div>
</div>

Getting "This is some" is nontrivial. For instance, this returns "This is some text":

driver.find_element_by_id('a').text

How does one, in a general way, get the text of a specific element without including the text of its children?

Upvotes: 60

Answers (5)

undetected Selenium

Reputation: 193308

In the HTML which you have shared:

<div id="a">This is some
   <div id="b">text</div>
</div>

The text This is some is within a text node. To depict the text node in a structured way:

<div id="a">
    This is some
   <div id="b">text</div>
</div>

This use case

To extract and print the text This is some from the text node using Selenium's python client, you have two ways as follows:

Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:

using xpath:

print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])

using css_selector:

print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])

Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:

using xpath and firstChild:

parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())

using xpath and childNodes[n]:

parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())

Upvotes: 15

Pikamander2

Reputation: 8319

Unfortunately, Selenium was only built to work with Elements, not Text nodes.

If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.

One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.

import bs4
from bs4 import BeautifulSoup

inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')

outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')

From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.

Here's a simple one-liner that may be sufficient:

inner_soup.find(text=True)

If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.

Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.

contents = inner_soup.contents

for bs4_object in contents:

    if (type(bs4_object) == bs4.Tag):
        print("This object is an Element.")

    elif (type(bs4_object) == bs4.NavigableString):
        print("This object is a Text node.")

Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.

Upvotes: 3

josh

Reputation: 9548

Use:

def get_true_text(tag):
    children = tag.find_elements_by_xpath('*')
    original_text = tag.text
    for child in children:
        original_text = original_text.replace(child.text, '', 1)
    return original_text

Upvotes: 6

Louis

Reputation: 151511

Here's a general solution:

def get_text_excluding_children(driver, element):
    return driver.execute_script("""
    return jQuery(arguments[0]).contents().filter(function() {
        return this.nodeType == Node.TEXT_NODE;
    }).text();
    """, element)

The element passed to the function can be something obtained from the find_element...() methods (i.e., it can be a WebElement object).

Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:

return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
    if (child.nodeType === Node.TEXT_NODE)
        ret += child.textContent;
    child = child.nextSibling;
}
return ret;
""", element)

I'm actually using this code in a test suite.

Upvotes: 30

kreativitea

Reputation: 1791

You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.

Upvotes: 4

How can I get text of an element in Selenium WebDriver, without including child element text?

Answers (5)

This use case

Related Questions