DmnkVD
DmnkVD

Reputation: 75

Puppeteer: Scrape sometimes works, sometimes fails with TypeError

As a personal challenge, I am trying to create a tool that would scrape the search results of a website (shopping platform AliBaba used for this experiment) using Puppeteer, and save the outputs into a JSON object, that can be later used to create a visualisation on the front-end.

My first step has been to access the first page of the search results, and scrape the listings from there into an array:

const puppeteer = require('puppeteer');
const fs = require('fs');

/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`

/* keyword to search for */
const keyword = `future`;

(async () => {
    try {
        const browser = await puppeteer.launch({
            headless: true
        });

        const page = await browser.newPage();

        await page.goto(url(keyword), {
            waitUntil: 'networkidle2'
        });

        await page.waitForSelector('.m-gallery-product-item-v2');

        let urls = await page.evaluate(() => {
            let results = [];
            let items = document.querySelectorAll('.m-gallery-product-item-v2');

            // This console.log never gets printed to either the browser window or the terminal?
            console.log(items)

            items.forEach( item => {
                let CurrentTime = Date.now();
                let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");
                let link = item.querySelector('.organic-list-offer__img-section').getAttribute("href");
                let img = item.querySelector('.seb-img-switcher__imgs').getAttribute("data-image");

                results.push({
                    'scrapeTime': CurrentTime,
                    'title': title,
                    'link': `https:${link}`,
                    'img': `https:${img}`,
                })
            });
            return results;
            
        })
        console.log(urls)
        browser.close();

    } catch (e) {
        console.log(e);
        browser.close();
    }
})();

When I run the file (test-2.js) in the terminal using Node, it sometimes returns the results array just fine, and on other times, throws an error. The terminal error that is thrown around half of the times is:

Error: Evaluation failed: TypeError: Cannot read property 'getAttribute' of null
    at __puppeteer_evaluation_script__:11:82
    at NodeList.forEach (<anonymous>)
    at __puppeteer_evaluation_script__:8:19
    at ExecutionContext._evaluateInternal (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:102:19)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async ExecutionContext.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:33:16)
    at async /Users/dmnk/scraper/test-2.js:24:20
  -- ASYNC --
    at ExecutionContext.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
    at DOMWorld.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/DOMWorld.js:89:24)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
  -- ASYNC --
    at Frame.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
    at Page.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/Page.js:612:14)
    at Page.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:95:27)
    at /Users/dmnk/scraper/test-2.js:24:31
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: ReferenceError: browser is not defined
    at /Users/dmnk/scraper/test-2.js:52:9
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:53159) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

I am relatively fresh to grasping and learning asynchronous JavaScript.

I have been trying for days to understand why this error happens, to no avail. Extremely thankful for any help in understanding the cause/troubleshooting.

Upvotes: 2

Views: 2616

Answers (2)

theDavidBarton
theDavidBarton

Reputation: 8841

You misuse async JavaScript a bit indeed, that causes the script to fail. For me with a slightly slow internet connection the Evaluation failed: TypeError: Cannot read property 'getAttribute' of null error was always present. You are able to improve stability a bit by replacing networkidle2 to domcontentloaded waitUntil setting in page.goto (make sure to read the docs what is the difference between them).

The main issue is the async events (communication with the chrome api) are not await-ed. You could start to refactor the script with keeping in mind the following:

Select elements more effectively

  1. I recommend to use const to avoid accidental overwriting of your already selected elements.
  2. Use the page context for identifying elements. Puppeteer (chrome) also provides you the $$ alias for querySelectorAll; $ : querySelector. (docs)
  3. Always await async events, everything is considered async what requires communication with the chrome api!

before:

let items = document.querySelectorAll('.m-gallery-product-item-v2');

after:

const items = await page.$$('.m-gallery-product-item-v2');

Evaluate DOM content

Use elementHandles with page.evaluate to retrieve content (.getAttribute is required in very rare cases):

before:

let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");

after:

const title = await page.evaluate(el => el.title, (await page.$$('h4.organic-gallery-title__outter'))[i])

Do not use forEach to iterate over async events

Luckily you haven't used async/await inside your forEach loop. But, actually the lack of async was the cause your script failed if the page was not loaded in time. You do need async, just not inside forEach (no, and not inside Array.map neither!). I rather suggest to use a for...of or regular for loops if you want predictible behaviour with puppeteer actions. (In the current example the array index has a key part so I used a for loop for simplicity)

Note: It is possible to use forEach, but you will need to wrap it by Promise.all.


Use try...catch on smaller fragments of the code

E.g.: inside the loop for each iteration, so your script won't crash if only one array element has an issue. It can be very frustrating if you are running a scraper for hours and it fails near the end.


Refactor the "urls" function

The page.evaluate part keeps the code kinda-async, but you could solve this also by using the suggestions above and awaiting for each step. You'd neither return the results object in the end, but you could populate it with each iteration of the loop.

Refactored example

It won't fail anymore, also the console.log(items); will be logged to console.

const puppeteer = require('puppeteer');

/* first page search URL */
const url = keyword => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`;

/* keyword to search for */
const keyword = 'future';

const results = [];

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  try {
    await page.goto(url(keyword), { waitUntil: 'domcontentloaded' });
    await page.waitForSelector('.m-gallery-product-item-v2');
    const items = await page.$$('.m-gallery-product-item-v2');

    // this console.log never gets printed to either the browser window or the terminal?
    console.log(items);

    for (let i = 0; i < items.length; i++) {
      try {
        let CurrentTime = Date.now();
        const title = await page.evaluate(el => el.title, (await page.$$('h4.organic-gallery-title__outter'))[i]);
        const link = await page.evaluate(el => el.href, (await page.$$('.organic-list-offer__img-section, .list-no-v2-left__img-container'))[i]);
        const img = await page.evaluate(el => el.getAttribute('data-image'), (await page.$$('.seb-img-switcher__imgs'))[i]);

        results.push({
          scrapeTime: CurrentTime,
          title: title,
          link: `https:${link}`,
          img: `https:${img}`
        });

      } catch (e) {
        console.error(e);
      }
    }

    console.log(results);
    await browser.close();
  } catch (e) {
    console.log(e);
    await browser.close();
  }
})();

Edit: The script can fail time-to-time due to on Alibaba's site the .organic-list-offer__img-section CSS class has been changed to .list-no-v2-left__img-container. They either AB testing two layouts with different selectors, or changing CSS classes this frequently.


Edit 2: In case of an element can have multiple selectors per user session (possibly due to product AB testing) one can use both possible selectors separated by a comma, like:

const link = await page.evaluate(el => el.href, (await page.$$('.organic-list-offer__img-section, .list-no-v2-left__img-container'))[i]);

This will ensure the element can be selected in both cases, the comma acts like an OR operator.

Upvotes: 3

Rustam D9RS
Rustam D9RS

Reputation: 3491

You need to check the existence of title, link and img before using getAttribute. Since, for example, for me, the link with your selector does not find, but it finds with this:

let link = item.querySelector('.organic-gallery-title').getAttribute('href');

I do not know what this is connected with, maybe because of different countries I have with you. In any case, you can check this selector and check how the program will work when using it. Hope this helps somehow.

You can perform an existence check as follows:

const puppeteer = require('puppeteer');
const fs = require('fs');

/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`

/* keyword to search for */
const keyword = `future`;

(async () => {
    try {
        const browser = await puppeteer.launch({
            headless: true
        });

        const page = await browser.newPage();

        await page.goto(url(keyword), { waitUntil: 'networkidle2' });

        await page.waitForSelector('.m-gallery-product-item-v2');

        const urls = await page.evaluate(() => {
            const results = [];
            const items = document.querySelectorAll('.m-gallery-product-item-v2');

            items.forEach(item => {
                const scrapeTime = Date.now();
                const titleElement = item.querySelector('h4.organic-gallery-title__outter');
                const linkElement = item.querySelector('.organic-list-offer__img-section');
                const imgElement = item.querySelector('.seb-img-switcher__imgs');
                
                /**
                 * You can combine croppedLink and link, or croppedImg and img to not make two variables if you want.
                 * But, in my opinion, separate variables are better. 
                 */
                const title = titleElement ? titleElement.getAttribute('title') : null;
                const croppedLink = linkElement ? linkElement.getAttribute('href') : null;
                const croppedImg = imgElement ? imgElement.getAttribute('data-image') : null;

                const link = croppedLink ? `https:${croppedLink}` : null;
                const img = croppedImg ? `https:${croppedImg}` : null;

                results.push({ scrapeTime, title, link, img });
            });

            return results;
        });

        console.log(urls);
        browser.close();
    } catch (e) {
        console.log(e);
        browser.close();
    }
})();

Upvotes: 1

Related Questions