Reputation: 15
Here's html code of page I'm trying to parse. (Its a bookstore) Part of the page code
<td width="300" class="highlight">
Додо Пресс,Фантом Пресс
I need to get text that is following
(translation - Publisher)
First i used nextsibling
from BeautifulSoup, it worked fine, but on other books' pages on the same site publisher element is't always in the same place which means my chain of next siblings doesn't get the right part of book description.
I tried to locate the exact text 'Издатель:' with Selenium
pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")
and it did the job. I got the text 'Издатель:'. After that i tried to locate next element following 'Издатель:' because the text that i need is always located after 'Издатель:'.
form Selenium doest work because publishers' name doesn't have class or tag etc.
I also tried running JS
pubs = driver.find_element(By.XPATH, "//*[text()='Издатель:']")
pub = driver.execute_script("""
return arguments[0].nextElement""", pubs)
pub = driver.execute_script("return document.evaluate('// [text()='Издатель:']/following-sibling::text()[1]'), document, null, XPathResult.FIRST_ORDERED_NODE_TYPE,null).singleNodeValue.textContent;")
Also didn't work.
Publisher element doesn't have any sibling or child element so i don't know how to get the text following it.
Site URL -
Upvotes: 0
Views: 194
Reputation: 193108
The text Додо Пресс,Фантом Пресс is within a Text Node so you have to use execute_script()
inducing WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:
Code Block:
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.collapsed"))).click()
print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[text()='Издатель:']//ancestor::td[1]")))).strip())
Console Output:
Додо Пресс,Фантом Пресс
You can find a couple of relevant detailed discussion in:
Upvotes: 1
Reputation: 2177
You can achieve this with the javascript code below. You can select every b
element and then get its parrent element and access innerText
document.querySelectorAll('b').forEach( element => {
<td width="300" class="highlight">
name 1
<td width="300" class="highlight">
name 2
<td width="300" class="highlight">
name 3
If there are other b
tags then you can check with if statment if the content of b
is publisher liek below
document.querySelectorAll('b').forEach( element => {
if(element.innerText == 'Publisher:'){
<td width="300" class="highlight">
name 1
<td width="300" class="highlight">
Date 1
<td width="300" class="highlight">
name 2
<td width="300" class="highlight">
name 3
Upvotes: 0