kapilgohil
kapilgohil

Reputation: 3

Puppeteer iterate div and then from result, iterate child element

I want to return an object from the HTML as below:

HTML

<div id="collection">
  <div class="div">
    <h1 class="title">Title 1</h1>
    <ul class="list">
      <li>list item 1</li>
      <li>list item 2</li>
      <li>list item 3</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 2</h1>
    <ul class="list">
      <li>list item 1a</li>
      <li>list item 2a</li>
      <li>list item 3a</li>
      <li>list item 4a</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 3</h1>
  </div>
</div>

Required result:

{
  title: "Title 1",
  list:{
    item: "list item 1",
    item: "list item 2",
    item: "list item 3"
  }
},
{
  title: "Title 2",
  list:{
    item: "list item 1a",
    item: "list item 2a",
    item: "list item 3a",
    item: "list item 4a"
  }
},
{
  title: "Title 3",
  list:{}
}

So far I have:

const result = await page.$$eval('div.div, (divs) => divs.map((div) => {
   return {
      title: div.querySelector('.title').innerText,
   }
}));
console.log(result) 

I am unsure on how I can use page.$$eval to then iterate another element; in this case the ul. Any help would be appreciated.

Thanks

Upvotes: 0

Views: 2698

Answers (3)

ggorlen
ggorlen

Reputation: 56855

Existing answers either don't work or promote rather brittle practices, like scraping "parallel" arrays and hoping the entries will line up.

Your attempt is actually very close, and $$eval is the correct tool for the job. Here's how I'd do it:

const puppeteer = require("puppeteer"); // ^22.6.0

const html = `<div id="collection">
  <div class="div">
    <h1 class="title">Title 1</h1>
    <ul class="list">
      <li>list item 1</li>
      <li>list item 2</li>
      <li>list item 3</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 2</h1>
    <ul class="list">
      <li>list item 1a</li>
      <li>list item 2a</li>
      <li>list item 3a</li>
      <li>list item 4a</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 3</h1>
  </div>
</div>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html, {waitUntil: "domcontentloaded"});
  const data = await page.$$eval(".div", els => els.map(el => ({
    title: el.querySelector(".title").textContent,
    list: [...el.querySelectorAll(".list li")].map(el => el.textContent),
  })));
  console.log(JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

[
  {
    "title": "Title 1",
    "list": [
      "list item 1",
      "list item 2",
      "list item 3"
    ]
  },
  {
    "title": "Title 2",
    "list": [
      "list item 1a",
      "list item 2a",
      "list item 3a",
      "list item 4a"
    ]
  },
  {
    "title": "Title 3",
    "list": []
  }
]

Note that my output is a bit different than requested, and I think, more optimal, because arrays map best to lists which are the most natural ordered, iterable structure. In fact, you can only have unique keys in an object, so the data structure you've requested isn't actually possible to create.

If you need random access or a keyed structure, you might want to do it by element text content, but then if two elements happen to have the same text content, you'll lose data.

To get an object, you can use something like:

list: Object.fromEntries([...el.querySelectorAll(".list li")]
  .map(el => [el.textContent, el.textContent])),

The output is a bit silly because the keys and values are the same, but that's about all the data that's available. Maybe if you have some sort of unique identifer property on the elements which you can use for the keys, you can substitute that in for the first element in the innermost array in my example code.

But if you're in any doubt, just use the array solution provided at the top of the post, which is the way to go 99% of the time in this sort of situation.

As a final note, if the structure you're scraping is rendered after page load, you can use something like page.waitForSelector("#collection .list") to wait for it to show up. Other blockers are possible, like cloudflare blocks, iframes and shadow DOMs, so if this doesn't work on your actual page, then there's likely some sort of confounding factor beyond the simple HTML shared here.

Upvotes: 2

Ludolfyn
Ludolfyn

Reputation: 2075

You can do it like this with Puppeteer. Just uncomment the page.evaluate() function. I just commented it so that you can run the snippet and see the results.

You can't have multiple object entries with keys that are all the same, so a better solution might be to return an array with the li values, as the keys do not matter anyway, right?

const result = []

//await page.evaluate(() => {
  const divs = document.querySelectorAll('.div')
  divs.forEach(div => {
    const obj = {
      title: div.querySelector('.title').innerText,
      list: [...div.querySelectorAll('ul li')].map(i => i.innerText)
    }
    result.push(obj)
  })
//})

console.log(result)
<div id="collection">
  <div class="div">
    <h1 class="title">Title 1</h1>
    <ul class="list">
      <li>list item 1</li>
      <li>list item 2</li>
      <li>list item 3</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 2</h1>
    <ul class="list">
      <li>list item 1a</li>
      <li>list item 2a</li>
      <li>list item 3a</li>
      <li>list item 4a</li>
    </ul>
  </div>
  <div class="div">
    <h1 class="title">Title 3</h1>
  </div>
</div>

Upvotes: 1

innis
innis

Reputation: 380

try to do the following on your puppeteer script, i think it might work:

const values = await page.evaluate(() => {
    const titles = Array.from(document.querySelectorAll('.title')).map(el => el.innerText); // this will get you an array with the titles  
    const list = Array.from(document.querySelectorAll('.list')).map(el => Array.from(el.children).map(elm => elm.innerText));

    const endArray = titles.map((el, index) => {
      return {
        title: el,
        list: list[index],
      }
    })

    return endArray;
});

You cannot have an object with repeated attributes as you're trying to do. It's a better practice to define your list as an array, since you know your list has a repeated type of data.

Upvotes: 1

Related Questions