Reputation: 63
I'm using puppeteer with jQuery and NodeJS to try and get list items from a web page:
<table>
<td class="hr">
<ul class="people">
<li class = "person">Richard</li>
<li class = "person">Linus</li>
<li class = "person">Brian</li>
<li class = "team_lead">Charles</li>
</ul>
</td>
<td class="manufacturing">
<ul class="people">
<li class = "person">Alan</ul>
<li class = "person">Margret</li>
<li class = "person">Ken</li>
<li class = "person">Edsger</li>
<li class = "team_lead">Dennis</li>
</ul>
</td>
<td class="design">
<ul class="people">
<li class = "person">Bill</li>
<li class = "person">Ada</li>
<li class = "person">Steve</li>
<li class = "person">Ken</li>
<li class = "team_lead">Dennis</li>
</ul>
</td>
</table>
and using the nodeJS code:
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
async function main(){
const browser = await puppeteer.launch({headless : false, defaultViewport: {width: 1920, height: 1080}});
const page = await browser.newPage();
await page.goto("${url}");
const htmlContent = await page.content();
const $ = cheerio.load(htmlContent);
let peopleList = [];
$(`table td .people`).each(function(i, li){
peopleList.push(li.text());
});
console.log(`people: ${peopleList}`);
}
main();
I have got this code for parsing through the list from another stackoverflow answer: How to store list items within an array with jQuery and using a Udemy tutorial, and tried to edit it accordingly.
I am looking to store each name in a two dimensional array, so something like:
peopleList = [[Richard, Linus, Brian, Charles], [Alan, Margret, Edsger, Dennis], [Bill, Ada, Steve, Ken, Dennis]];
however I am getting a single string:
RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,RichardLinusBrianCharlesAlanMargretEdsgerDenisBillAdSteveKenDennis,...
(repeat for each ul element) and when I try to go deeper and include li tags I just get an empty string.
Upvotes: 1
Views: 233
Reputation: 56855
There is no need to use Cheerio with Puppeteer. Puppeteer already works with the live page, so it generally doesn't make sense to snapshot the page into a string, then dump it into a separate library. This is inefficient and leads to confusing bugs when the snapshot goes stale.
Instead, use page.$$eval(yourSelector, browserCallback)
to do the job:
const puppeteer = require("puppeteer"); // ^21.6.0
const html = `<HTML pasted from your question>`;
let browser;
(async () => {
browser = await puppeteer.launch({headless: "new"});
const [page] = await browser.pages();
await page.setContent(html);
const sel = "table td .people .person";
await page.waitForSelector(sel);
const people = await page.$$eval(
sel,
els => els.map(el => el.textContent.trim())
);
console.log(people);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output:
[
'Richard', 'Linus',
'Brian', 'Alan',
'Bill', 'Ada',
'Steve', 'Ken'
]
The joined string issue was resolved above by using the selector table td .people .person
, which would technically work in the Cheerio approach as well.
If you want to keep the categories distinct, you could use a nested query:
// ...
const people = await page.$$eval("table td", els =>
els.map(el => ({
category: el.className,
people: [...el.querySelectorAll(".person")].map(e =>
e.textContent.trim()
),
}))
);
// ...
which gives:
[
{ category: 'hr', people: [ 'Richard', 'Linus', 'Brian' ] },
{
category: 'manufacturing',
people: [ 'Alan', 'Margret', 'Ken', 'Edsger' ]
},
{ category: 'design', people: [ 'Bill', 'Ada', 'Steve', 'Ken' ] }
]
All that said, if the page you're working with has the data you want statically, using fetch
and Cheerio may make sense. But I'm assuming you're working with a SPA or website that requires some interaction to get to the scrape point, or there's some other good motivator for using Puppeteer.
As another aside, if you wind up sticking with Puppeteer but prefer to use jQuery, you can either add it, or use it if the page happens to have jQuery included already. You'll then access $
inside an evaluate
-family callback that runs in the browser context. This makes more sense than using Cheerio in most cases, since you're taking advantage of the realtime page abilities of Puppeteer and won't suffer from stale data issues.
To answer your other question, for demo and reproducibility purposes, I use setContent
as shown above, but you can run a server and navigate to your page on localhost. Just make sure to include the port.
Disclosure: I'm the author of the linked blog post.
Upvotes: 2