Bishoy
Bishoy

Reputation: 725

Selecting parent element text only using Selenium

I am using a Python script to extract information from a website using Selenium library. Using some selector, I got a WebElement object of the target element I am after which looks something like the following:

<myTargetElement><strong>324. </strong>Some interesting content that might contain numbers 323 or dots ...,;</myTargetElement>

I want to extract two pieces of information in separate:

The Id surrounded by the strong tag, and I've done this as following:

myTargetElementObject.find_element_by_tag_name('strong').text.strip(' .')

Now I am puzzled how to extract the other part. If I used myTargetElementObject.text, it will return the id within the text.

The data I am extracting is very big and I am cautious about using regex. Is there a way using WebElement object to return the text of the element without the sub-elements?

Upvotes: 1

Views: 1056

Answers (1)

alecxe
alecxe

Reputation: 474171

I would get the complete text of the target element and split it by the first .:

strong, rest_of_the_content = myTargetElementObject.text.split(". ", 1)

In general though, the task is not that easy (here you have a clear delimiter): you cannot target and get the text nodes directly in selenium - things like following-sibling::text(). A common approach is to get the child text, parent text and remove the child text from the parent's:


Another possible approach would involve some separate HTML parsing with BeautifulSoup where you can go sideways and access text nodes:

from bs4 import BeautifulSoup

html = myTargetElementObject.get_attribute("outerHTML")
soup = BeautifulSoup(html, "html.parser")
label = soup.strong
text_after = label.next_sibling

print(label.get_text(), text_after)

Upvotes: 2

Related Questions