Reputation: 81
I want to extract Elements from various webpages by using the selenium driver package. I identify target elements by their texts, using find_elements_by_xpath
. Although I thought I was able to solve issues with "whitespaces","breaks" etc., the following element is NOT found by my code, unfortunately.
This is the element that I am trying to find by using its text:
x = """<p align="left"><font face="Arial" color="#439539" size="5">Compensation
Discussion<br>& Analysis</font></p>"""
This is a screenshot of the original code of the respective webpage.
This is the Code that I am currently using to identify elements that contain the text "Compensation Discussion & Analysis":
searchterm = "Compensation Discussion & Analysis
driver.find_elements_by_xpath("//*[contains(normalize-space(translate(., '\u00A0', ' ')), '" + searchterm + "')]")
I know that there might be ways to only include parts of my search-term, such as starts-with()
and alike. However, I would highly prefer to maintain looking for the entire search-term without splitting it into its components.
Any help is highly appreciated! Thanks a lot in advance!
Upvotes: 1
Views: 259
Reputation: 3753
What you have looks good and I would expect normalize-space()
to work - however, clearly that <br>
in the middle is an interesting one.
What i can tell you is that the br
is causing the text to be split into 2 nodes. You actually have text()
and text()[2]
.
I've only tried this in chrome, I've not attempted it in selenium yet but try this xpath:
//font[contains(normalize-space(concat(text(), ' ', text()[2])),'Compensation Discussion & Analysis')]
(note that i matched this to font
but you can update as needed)
This matches your troublesome object and others by full text - which i think is what you're after.
This is how my devtools looks:
What could also be useful is you can also add additional items to the concat, even if they don't exist, and still retain the matches:
//font[contains(normalize-space(concat(text(), ' ', text()[2], ' ', text[3])),'Compensation Discussion & Analysis')]
That might mean one identifier to match them all..
Final comment - You can see in the middle i join the two nodes WITH A SPACE concat(text(), ' ', text()[2])
- this is because the text of the nodes is Compensation Discussion↵& Analysis
- there is no space between "Discussion" and "&" - adding this space increases consistency with the rest of the document.
[udpate]
After all the above (which works!) I thought about that "final comment" again....
I looked again and normalize-space
does work - your text just doesn't have a space before the ampersand...
Upvotes: 2
Reputation: 62
Try this if you are looking for the entire search term on the page:
string=driver.find_element_by_xpath("//div[19]/table[1]/tbody[1]/tr[20]/td[1]/font[1]")
print(string.text)
OR
print(string.get_attribute("innerHTML")
This should do the job!
Upvotes: -1