Reputation: 725
I am using a Python script to extract information from a website using Selenium library. Using some selector, I got a WebElement object of the target element I am after which looks something like the following:
<myTargetElement><strong>324. </strong>Some interesting content that might contain numbers 323 or dots ...,;</myTargetElement>
I want to extract two pieces of information in separate:
The Id surrounded by the strong
tag, and I've done this as following:
myTargetElementObject.find_element_by_tag_name('strong').text.strip(' .')
Now I am puzzled how to extract the other part. If I used myTargetElementObject.text
, it will return the id within the text.
The data I am extracting is very big and I am cautious about using regex. Is there a way using WebElement object to return the text of the element without the sub-elements?
Upvotes: 1
Views: 1056
Reputation: 474171
I would get the complete text of the target element and split it by the first .
:
strong, rest_of_the_content = myTargetElementObject.text.split(". ", 1)
In general though, the task is not that easy (here you have a clear delimiter): you cannot target and get the text nodes directly in selenium - things like following-sibling::text()
. A common approach is to get the child text, parent text and remove the child text from the parent's:
Another possible approach would involve some separate HTML parsing with BeautifulSoup
where you can go sideways and access text nodes:
from bs4 import BeautifulSoup
html = myTargetElementObject.get_attribute("outerHTML")
soup = BeautifulSoup(html, "html.parser")
label = soup.strong
text_after = label.next_sibling
print(label.get_text(), text_after)
Upvotes: 2