Reputation: 3572
I have following html:
<article data-tid="product-detail"> <!-- I page.$eval on this element. -->
<h1 itemprop="name">Product name</h1> <!-- This I can query by itemprop -->
<h2>Some other topic</h2>
<p>I don't want this text.</p>
<h2>Unique topic</h2> <!-- I can find this innerText === "Unique topic". -->
<p>Text I want real' bad.</p> <!-- I want this innerText. -->
<h2>Some other topic</h2>
<p>I don't want this text.</p>
</article>
How do I get the "Text I want real' bad." from the page, knowing the "Unique topic"?
I run puppeteer from node script.
This is what I have so far:
async function puppeteerProductDataExtractor($product) {
// This works like charm.
const productName = $product.querySelector('[itemprop=name]')?.innerText;
// Now I have to find the right h2 and get it's next element sibling.
const h2 = $product.querySelectorAll('h2');
// 1. If I try to get innerText of all h2, it fails with getProperty function not defined.
console.log(
await Promise.all([...h2].map(async $el => await (await element.getProperty('innerText')).jsonValue()))
);
// 2. This returns empty array
console.log([...h2].filter($el => $el.innerText.startsWith('Unique topic')));
// This prints JSHandle@array - innerText is not a string.
console.log([...h2].map($el => $el.innerText));
// This also prints JSHandle@array which is just insane.
console.log([...h2].map($el => Object.keys($el)));
// This fails with "property is not a function" error.
console.log([...h2].map(el => el.property('innerText')));
// So does this.
console.log([...h2].map(el => el.getProperty('innerText')));
}
page.on('console', consoleObj => console.log('xxxx', consoleObj.text()));
const product = await page.$eval('article[data-tid=product-detail]', puppeteerProductDataExtractor);
The first attempt comes from here: https://stackoverflow.com/a/52828950/336753
Everything else is just frustrated blind shooting. Must admit I'm very confused. Some of the stuff should work according to docs, but it just fails. Like JSHandle should have the property
function, but it fails, when I call it (not a function).
I didn't even get to the nextSibling
part.
I tried a lot of code which mostly failed and don't want to pollute the question with it. It feels this should be really simple and I'm just missing something. Hope the original intention is clear.
I'm sure there is a simple solution, but trial-and-failure seems not to be the way to get to it.
After more digging, turns out my original intention was correct. The filter()
didn't work not because innerText would be instance of JSHandle (as it appears), but because uppercase first letter was done by CSS (had to lowercase to unify the string before comparison). Kinda ashamed here... Sorry and thanks @ggorlen for assistance.
/* WE'RE INSIDE $eval FUNCTION */
// This returns JSHandle@array which is just weird...
console.log([...h2].map(el => el.innerText));
// But this returns the joined string correctly. Huh...
console.log([...h2].map(el => el.innerText).join(';'));
// So this eventually works
console.log([...h2]
.filter(el => el.textContent.toLowerCase().startsWith('unique topic'))[0]?
.nextElementSibling.textContent);
Upvotes: 1
Views: 960
Reputation: 57145
You could try nextElementSibling
rather than nextSibling
, which can return a whitespace text node:
const text = document
.querySelector("h2")
.nextElementSibling
.textContent
;
console.log(text);
<h2>Unique topic</h2>
<p>Text I want real' bad.</p>
While you haven't shared your markup, it looks like you might have multiple of these <h2>
/<p>
combos and you want to extract the text from each. This might help get you started:
const text = [...document.querySelectorAll("h2")]
.map(e => e.nextElementSibling.textContent)
;
console.log(text);
<h2>Unique topic 1</h2>
<p>Text I want real' bad. 1</p>
<h2>Unique topic 2</h2>
<p>Text I want real' bad. 2</p>
<h2>Unique topic 3</h2>
<p>Text I want real' bad. 3</p>
If it's not obvious, the above code must run inside an $eval
, $$eval
or evaluate
. For example, you could use:
const puppeteer = require("puppeteer");
let browser;
(async () => {
const html = `
<h2>Unique topic 1</h2>
<p>Text I want real' bad. 1</p>
<h2>Unique topic 2</h2>
<p>Text I want real' bad. 2</p>
<h2>Unique topic 3</h2>
<p>Text I want real' bad. 3</p>
`;
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const contents = await page.$$eval(
"h2",
els => els.map(e => e.nextElementSibling.textContent)
);
console.log(contents);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Your line
await (await element.getProperty('innerText')).jsonValue()
is pure Node Puppeteer, yet you're attempting to run that inside the browser console. That's a common mistake -- elementHandles work only in Puppeteer. The thread you linked offers a bottom example that shows the evaluate
approach, which uses browser-only code.
For debugging browser code (stuff executed in a callback to evaluate
, $eval
, $$eval
, etc), I recommend attaching a listener to the console as shown in How do print the console output of the page in puppeter as it would appear in the browser? or run headfully so you can see the error message.
Another tip is to work out your selectors in the browser by hand, then add them to Puppeteer's evaluate
only after you have them working. evaluate
is general, so all DOM manipulation you can do with shorthand Puppeteer page
and elementHandle convenience methods like .click()
, .getProperty()
, .$eval
, .$x
, etc can be done directly in evaluate
.
Upvotes: 2