kashiB
kashiB

Reputation: 207

How do I select text content that isn't wrapped in an HTML tag with XPath?

How do I capture TARGET from the following HTML sample with XPath and Puppeteer?

<div id="parent">
    <div id="sibling_1"> Hello </div>
    <div id="sibling_2"> Good </div>
    TARGET
    <div id="sibling_3"> Bye </div>
</div>

I can get Good Bye with the following code, but I don't think there is a way to get TARGET.

let xpath = '//*[@id="sibling_1"]/following-sibling::*';
let elements = await page.$x(xpath);
for(var j in elements){
 let xpathTextContent = await elements[j].getProperty('textContent')
 let text = await xpathTextContent.jsonValue();
 console.log("Text: ",text);
}

Upvotes: 0

Views: 56

Answers (3)

ggorlen
ggorlen

Reputation: 57344

If you don't need to use XPath in particular, a plain CSS selector with child node iteration works:

import puppeteer from "puppeteer"; // ^22.7.1

const html = `<div id="parent">
 <div id="sibling_1"> Hello </div>
 <div id="sibling_2"> Good </div>
 TARGET
 <div id="sibling_3"> Bye </div>
</div>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const text = await page.$eval("#parent", el =>
    [...el.childNodes]
      .find(
        e =>
          e.textContent.trim() && e.nodeType === Node.TEXT_NODE
      )
      .textContent.trim()
  );
  console.log(text); // => TARGET
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

If you want all of the text nodes in cases where there are multiple:

[...el.childNodes]
  .filter(
    e =>
      e.textContent.trim() && e.nodeType === Node.TEXT_NODE
  )
  .map(e => e.textContent.trim())
  .join("") // optional, you may prefer an array

If your logic is that you want to select the next sibling after #sibling_2, then use:

const text = await page.$eval("#sibling_2", el =>
  el.nextSibling.textContent.trim()
);

Upvotes: 0

kashiB
kashiB

Reputation: 207

It turns out TARGET belongs to the parent element:

let xpath = '//*[@id="parent"]';
let elements = await page.$x(xpath);
let xpathTextContent = await elements[0].getProperty('textContent')
let text = await xpathTextContent.jsonValue();

Upvotes: 0

supputuri
supputuri

Reputation: 14145

Here is the solution in javascript.

document.querySelector('div#parent').innerText

Upvotes: 1

Related Questions