Reputation: 3
I want to return an object from the HTML as below:
HTML
<div id="collection">
<div class="div">
<h1 class="title">Title 1</h1>
<ul class="list">
<li>list item 1</li>
<li>list item 2</li>
<li>list item 3</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 2</h1>
<ul class="list">
<li>list item 1a</li>
<li>list item 2a</li>
<li>list item 3a</li>
<li>list item 4a</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 3</h1>
</div>
</div>
Required result:
{
title: "Title 1",
list:{
item: "list item 1",
item: "list item 2",
item: "list item 3"
}
},
{
title: "Title 2",
list:{
item: "list item 1a",
item: "list item 2a",
item: "list item 3a",
item: "list item 4a"
}
},
{
title: "Title 3",
list:{}
}
So far I have:
const result = await page.$$eval('div.div, (divs) => divs.map((div) => {
return {
title: div.querySelector('.title').innerText,
}
}));
console.log(result)
I am unsure on how I can use page.$$eval to then iterate another element; in this case the ul. Any help would be appreciated.
Thanks
Upvotes: 0
Views: 2698
Reputation: 56855
Existing answers either don't work or promote rather brittle practices, like scraping "parallel" arrays and hoping the entries will line up.
Your attempt is actually very close, and $$eval
is the correct tool for the job. Here's how I'd do it:
const puppeteer = require("puppeteer"); // ^22.6.0
const html = `<div id="collection">
<div class="div">
<h1 class="title">Title 1</h1>
<ul class="list">
<li>list item 1</li>
<li>list item 2</li>
<li>list item 3</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 2</h1>
<ul class="list">
<li>list item 1a</li>
<li>list item 2a</li>
<li>list item 3a</li>
<li>list item 4a</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 3</h1>
</div>
</div>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html, {waitUntil: "domcontentloaded"});
const data = await page.$$eval(".div", els => els.map(el => ({
title: el.querySelector(".title").textContent,
list: [...el.querySelectorAll(".list li")].map(el => el.textContent),
})));
console.log(JSON.stringify(data, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output:
[
{
"title": "Title 1",
"list": [
"list item 1",
"list item 2",
"list item 3"
]
},
{
"title": "Title 2",
"list": [
"list item 1a",
"list item 2a",
"list item 3a",
"list item 4a"
]
},
{
"title": "Title 3",
"list": []
}
]
Note that my output is a bit different than requested, and I think, more optimal, because arrays map best to lists which are the most natural ordered, iterable structure. In fact, you can only have unique keys in an object, so the data structure you've requested isn't actually possible to create.
If you need random access or a keyed structure, you might want to do it by element text content, but then if two elements happen to have the same text content, you'll lose data.
To get an object, you can use something like:
list: Object.fromEntries([...el.querySelectorAll(".list li")]
.map(el => [el.textContent, el.textContent])),
The output is a bit silly because the keys and values are the same, but that's about all the data that's available. Maybe if you have some sort of unique identifer property on the elements which you can use for the keys, you can substitute that in for the first element in the innermost array in my example code.
But if you're in any doubt, just use the array solution provided at the top of the post, which is the way to go 99% of the time in this sort of situation.
As a final note, if the structure you're scraping is rendered after page load, you can use something like page.waitForSelector("#collection .list")
to wait for it to show up. Other blockers are possible, like cloudflare blocks, iframes and shadow DOMs, so if this doesn't work on your actual page, then there's likely some sort of confounding factor beyond the simple HTML shared here.
Upvotes: 2
Reputation: 2075
You can do it like this with Puppeteer. Just uncomment the page.evaluate()
function. I just commented it so that you can run the snippet and see the results.
You can't have multiple object entries with keys that are all the same, so a better solution might be to return an array with the li
values, as the keys do not matter anyway, right?
const result = []
//await page.evaluate(() => {
const divs = document.querySelectorAll('.div')
divs.forEach(div => {
const obj = {
title: div.querySelector('.title').innerText,
list: [...div.querySelectorAll('ul li')].map(i => i.innerText)
}
result.push(obj)
})
//})
console.log(result)
<div id="collection">
<div class="div">
<h1 class="title">Title 1</h1>
<ul class="list">
<li>list item 1</li>
<li>list item 2</li>
<li>list item 3</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 2</h1>
<ul class="list">
<li>list item 1a</li>
<li>list item 2a</li>
<li>list item 3a</li>
<li>list item 4a</li>
</ul>
</div>
<div class="div">
<h1 class="title">Title 3</h1>
</div>
</div>
Upvotes: 1
Reputation: 380
try to do the following on your puppeteer script, i think it might work:
const values = await page.evaluate(() => {
const titles = Array.from(document.querySelectorAll('.title')).map(el => el.innerText); // this will get you an array with the titles
const list = Array.from(document.querySelectorAll('.list')).map(el => Array.from(el.children).map(elm => elm.innerText));
const endArray = titles.map((el, index) => {
return {
title: el,
list: list[index],
}
})
return endArray;
});
You cannot have an object with repeated attributes as you're trying to do. It's a better practice to define your list as an array, since you know your list has a repeated type of data.
Upvotes: 1