Everett De Leon
Everett De Leon

Reputation: 15

How to scrape image src the right way using puppeteer?

I'm trying to create a function that can capture the src attribute from a website. But all of the most common ways of doing so, aren't working.

This was my original attempt.

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  try {
    await page.setDefaultNavigationTimeout(0);
    await page.waitForTimeout(500);

    await page.goto(
      `https://www.sirved.com/restaurant/essex-ontario-canada/dairy-freez/1/menus/3413654`,
      {
        waitUntil: "domcontentloaded",
      }
    );

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(
        "#menus > div.tab-content >div > div > div.swiper-wrapper > div.swiper-slide > img"
      );
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);
 } catch (err) {
    console.log(err);
  }

  await browser.close();
})();

[];

In my next attempt I tried a suggestion and was returned an empty string.

    await page.setViewport({ width: 1024, height: 768 });
    
    const imgs = await page.$$eval("#menus img", (images) =>
      images.map((i) => i.src)
    );

    console.log(imgs);

And in my final attempt I fallowed another suggestion and was returned an array with two empty strings inside of it.

    const fetchImgSrc = await page.evaluate(() => {
      const img = document.querySelectorAll(".swiper-lazy-loaded");
      let src = [];
      for (let i = 0; i < img.length; i++) {
        src.push(img[i].getAttribute("src"));
      }
      return src;
    });

    console.log(fetchImgSrc);

In each attempt i only replaced the function and console log portion of the code. I've done a lot of digging and found these are the most common ways of scrapping an image src using puppeteer and I've used them in other ways but for some reason right now they aren't working for me. I'm not sure if I have a bug in my code or why it will not work.

Upvotes: 1

Views: 1695

Answers (2)

theDavidBarton
theDavidBarton

Reputation: 8851

You have two issues here:

  1. Puppeteer by default opens the page in a smaller window and the images to be scraped are lazy loaded, while they are not in the viewport: they won't be loaded (not even have src-s). You need to set your puppeteer browser to a bigger size with page.setViewport.
  2. Element.getAttribute is not advised if you are working with dynamically changing websites: It will always return the original attribute value, which is an empty string in the lazy loaded image. What you need is the src property that is always up-to-date in the DOM. It is a topic of attribute vs property value in JavaScript.

By the way: you can shorten your script with page.$$eval like this:

await page.setViewport({ width: 1024, height: 768 })
const imgs = await page.$$eval('#menus img', images => images.map(i => i.src))
console.log(imgs)

Output:

[
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg',
  'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg'
]

Upvotes: 1

Himanshu Poddar
Himanshu Poddar

Reputation: 7789

To return the src link for the two menu images on this page you can use

const fetchImgSrc = await page.evaluate(() => {
    const img = document.querySelectorAll('.swiper-lazy-loaded');
    let src = [];
    for (let i = 0; i < img.length; i++) {
       src.push(img[i].getAttribute("src"));
    }
    return src;
});

This gives us the expected output

['https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3b9eabc40.jpg', 'https://images.sirved.com/ChIJq6qqqlrZOogRs_xGxBcn0_w/5caf3bbe93cc6.jpg']

Upvotes: 1

Related Questions