Reputation: 9538
Consider:
<div id="a">This is some
<div id="b">text</div>
</div>
Getting "This is some" is nontrivial. For instance, this returns "This is some text":
driver.find_element_by_id('a').text
How does one, in a general way, get the text of a specific element without including the text of its children?
Upvotes: 60
Views: 166218
Reputation: 193058
In the HTML which you have shared:
<div id="a">This is some
<div id="b">text</div>
</div>
The text This is some
is within a text node. To depict the text node in a structured way:
<div id="a">
This is some
<div id="b">text</div>
</div>
To extract and print the text This is some
from the text node using Selenium's python client, you have two ways as follows:
Using splitlines()
: You can identify the parent element i.e. <div id="a">
, extract the innerHTML
and then use splitlines()
as follows:
using xpath:
print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])
using css_selector:
print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])
Using execute_script()
: You can also use the execute_script()
method which can synchronously execute JavaScript in the current window/frame as follows:
using xpath and firstChild:
parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())
using xpath and childNodes[n]:
parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())
Upvotes: 15
Reputation: 8299
Unfortunately, Selenium was only built to work with Elements, not Text nodes.
If you try to use a function like get_element_by_xpath
to target the text nodes, Selenium will throw an InvalidSelectorException
.
One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like Beautiful Soup that can handle text nodes more elegantly.
import bs4
from bs4 import BeautifulSoup
inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')
outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')
From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.
Here's a simple one-liner that may be sufficient:
inner_soup.find(text=True)
If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.
Beautiful Soup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.
contents = inner_soup.contents
for bs4_object in contents:
if (type(bs4_object) == bs4.Tag):
print("This object is an Element.")
elif (type(bs4_object) == bs4.NavigableString):
print("This object is a Text node.")
Note that Beautiful Soup doesn't support XPath expressions. If you need those, then you can use some of the workarounds in this question.
Upvotes: 3
Reputation: 9538
Use:
def get_true_text(tag):
children = tag.find_elements_by_xpath('*')
original_text = tag.text
for child in children:
original_text = original_text.replace(child.text, '', 1)
return original_text
Upvotes: 6
Reputation: 151380
Here's a general solution:
def get_text_excluding_children(driver, element):
return driver.execute_script("""
return jQuery(arguments[0]).contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
""", element)
The element passed to the function can be something obtained from the find_element...()
methods (i.e., it can be a WebElement
object).
Or if you don't have jQuery or don't want to use it, you can replace the body of the function above with this:
return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
if (child.nodeType === Node.TEXT_NODE)
ret += child.textContent;
child = child.nextSibling;
}
return ret;
""", element)
I'm actually using this code in a test suite.
Upvotes: 30
Reputation: 1791
You don't have to do a replace. You can get the length of the children text, subtract that from the overall length, and slice into the original text. That should be substantially faster.
Upvotes: 4