Sanjeet Kunwar
Sanjeet Kunwar

Reputation: 103

How to scrape image src urls using node js and puppeteer

I want to scrape an image from wikipedia page but the problem is i am getting 3 urls of the same image at a time and those three urls are in the same tag called img .I just want src url. Anybody knows how to do it.

const puppeteer = require('puppeteer');
const sleep = require('sleep');

(async ()=> {

    const browser = await puppeteer.launch({
        "headless": false
    });

    const page =await browser.newPage();

    await page.goto("https://www.wikipedia.org/");

    const xpathselector = `//span[contains(text(), "Commons")]`;

    const commonlinks = await page.waitForXPath(xpathselector);

    await page.waitFor(3000);

    await commonlinks.click();

    await page.waitFor(2000)

    //await page.waitForSelector()

    const images = await page.$eval(('a[class="image"] > img[src]'),node => node.innerHTML);

    console.log(images);

} ) ();

//*[@id="mainpage-potd"]/div[1]/a/img

Upvotes: 2

Views: 3534

Answers (1)

hardkoded
hardkoded

Reputation: 21695

I bet that you "see" three URLs because you are looking at the srcset, which has many URLs for different screens. resolutions. You could return the src property instead:

const images = await page.$eval(('a[class="image"] > img[src]'),node => node.src);

Upvotes: 4

Related Questions