Reputation: 75
As a personal challenge, I am trying to create a tool that would scrape the search results of a website (shopping platform AliBaba used for this experiment) using Puppeteer, and save the outputs into a JSON object, that can be later used to create a visualisation on the front-end.
My first step has been to access the first page of the search results, and scrape the listings from there into an array:
const puppeteer = require('puppeteer');
const fs = require('fs');
/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`
/* keyword to search for */
const keyword = `future`;
(async () => {
try {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto(url(keyword), {
waitUntil: 'networkidle2'
});
await page.waitForSelector('.m-gallery-product-item-v2');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('.m-gallery-product-item-v2');
// This console.log never gets printed to either the browser window or the terminal?
console.log(items)
items.forEach( item => {
let CurrentTime = Date.now();
let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");
let link = item.querySelector('.organic-list-offer__img-section').getAttribute("href");
let img = item.querySelector('.seb-img-switcher__imgs').getAttribute("data-image");
results.push({
'scrapeTime': CurrentTime,
'title': title,
'link': `https:${link}`,
'img': `https:${img}`,
})
});
return results;
})
console.log(urls)
browser.close();
} catch (e) {
console.log(e);
browser.close();
}
})();
When I run the file (test-2.js) in the terminal using Node, it sometimes returns the results
array just fine, and on other times, throws an error. The terminal error that is thrown around half of the times is:
Error: Evaluation failed: TypeError: Cannot read property 'getAttribute' of null
at __puppeteer_evaluation_script__:11:82
at NodeList.forEach (<anonymous>)
at __puppeteer_evaluation_script__:8:19
at ExecutionContext._evaluateInternal (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:102:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at async ExecutionContext.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:33:16)
at async /Users/dmnk/scraper/test-2.js:24:20
-- ASYNC --
at ExecutionContext.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
at DOMWorld.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/DOMWorld.js:89:24)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
-- ASYNC --
at Frame.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
at Page.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/Page.js:612:14)
at Page.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:95:27)
at /Users/dmnk/scraper/test-2.js:24:31
at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: ReferenceError: browser is not defined
at /Users/dmnk/scraper/test-2.js:52:9
at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:53159) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
I am relatively fresh to grasping and learning asynchronous JavaScript.
I have been trying for days to understand why this error happens, to no avail. Extremely thankful for any help in understanding the cause/troubleshooting.
Upvotes: 2
Views: 2616
Reputation: 8841
You misuse async JavaScript a bit indeed, that causes the script to fail. For me with a slightly slow internet connection the Evaluation failed: TypeError: Cannot read property 'getAttribute' of null
error was always present. You are able to improve stability a bit by replacing networkidle2
to domcontentloaded
waitUntil setting in page.goto
(make sure to read the docs what is the difference between them).
The main issue is the async events (communication with the chrome api) are not await-ed. You could start to refactor the script with keeping in mind the following:
const
to avoid accidental overwriting of your already selected elements.page
context for identifying elements. Puppeteer (chrome) also provides you the $$
alias for querySelectorAll
; $
: querySelector
. (docs)await
async events, everything is considered async what requires communication with the chrome api!before:
let items = document.querySelectorAll('.m-gallery-product-item-v2');
after:
const items = await page.$$('.m-gallery-product-item-v2');
Use elementHandles with page.evaluate to retrieve content (.getAttribute
is required in very rare cases):
before:
let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");
after:
const title = await page.evaluate(el => el.title, (await page.$$('h4.organic-gallery-title__outter'))[i])
forEach
to iterate over async eventsLuckily you haven't used async/await inside your forEach
loop. But, actually the lack of async was the cause your script failed if the page was not loaded in time. You do need async, just not inside forEach
(no, and not inside Array.map
neither!). I rather suggest to use a for...of
or regular for loops if you want predictible behaviour with puppeteer actions. (In the current example the array index has a key part so I used a for loop for simplicity)
Note: It is possible to use forEach
, but you will need to wrap it by Promise.all
.
E.g.: inside the loop for each iteration, so your script won't crash if only one array element has an issue. It can be very frustrating if you are running a scraper for hours and it fails near the end.
The page.evaluate
part keeps the code kinda-async, but you could solve this also by using the suggestions above and awaiting for each step. You'd neither return the results
object in the end, but you could populate it with each iteration of the loop.
It won't fail anymore, also the console.log(items);
will be logged to console.
const puppeteer = require('puppeteer');
/* first page search URL */
const url = keyword => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`;
/* keyword to search for */
const keyword = 'future';
const results = [];
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.goto(url(keyword), { waitUntil: 'domcontentloaded' });
await page.waitForSelector('.m-gallery-product-item-v2');
const items = await page.$$('.m-gallery-product-item-v2');
// this console.log never gets printed to either the browser window or the terminal?
console.log(items);
for (let i = 0; i < items.length; i++) {
try {
let CurrentTime = Date.now();
const title = await page.evaluate(el => el.title, (await page.$$('h4.organic-gallery-title__outter'))[i]);
const link = await page.evaluate(el => el.href, (await page.$$('.organic-list-offer__img-section, .list-no-v2-left__img-container'))[i]);
const img = await page.evaluate(el => el.getAttribute('data-image'), (await page.$$('.seb-img-switcher__imgs'))[i]);
results.push({
scrapeTime: CurrentTime,
title: title,
link: `https:${link}`,
img: `https:${img}`
});
} catch (e) {
console.error(e);
}
}
console.log(results);
await browser.close();
} catch (e) {
console.log(e);
await browser.close();
}
})();
Edit: The script can fail time-to-time due to on Alibaba's site the .organic-list-offer__img-section
CSS class has been changed to .list-no-v2-left__img-container
. They either AB testing two layouts with different selectors, or changing CSS classes this frequently.
Edit 2: In case of an element can have multiple selectors per user session (possibly due to product AB testing) one can use both possible selectors separated by a comma, like:
const link = await page.evaluate(el => el.href, (await page.$$('.organic-list-offer__img-section, .list-no-v2-left__img-container'))[i]);
This will ensure the element can be selected in both cases, the comma acts like an OR
operator.
Upvotes: 3
Reputation: 3491
You need to check the existence of title
, link
and img
before using getAttribute
. Since, for example, for me, the link
with your selector does not find, but it finds with this:
let link = item.querySelector('.organic-gallery-title').getAttribute('href');
I do not know what this is connected with, maybe because of different countries I have with you. In any case, you can check this selector and check how the program will work when using it. Hope this helps somehow.
You can perform an existence check as follows:
const puppeteer = require('puppeteer');
const fs = require('fs');
/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`
/* keyword to search for */
const keyword = `future`;
(async () => {
try {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto(url(keyword), { waitUntil: 'networkidle2' });
await page.waitForSelector('.m-gallery-product-item-v2');
const urls = await page.evaluate(() => {
const results = [];
const items = document.querySelectorAll('.m-gallery-product-item-v2');
items.forEach(item => {
const scrapeTime = Date.now();
const titleElement = item.querySelector('h4.organic-gallery-title__outter');
const linkElement = item.querySelector('.organic-list-offer__img-section');
const imgElement = item.querySelector('.seb-img-switcher__imgs');
/**
* You can combine croppedLink and link, or croppedImg and img to not make two variables if you want.
* But, in my opinion, separate variables are better.
*/
const title = titleElement ? titleElement.getAttribute('title') : null;
const croppedLink = linkElement ? linkElement.getAttribute('href') : null;
const croppedImg = imgElement ? imgElement.getAttribute('data-image') : null;
const link = croppedLink ? `https:${croppedLink}` : null;
const img = croppedImg ? `https:${croppedImg}` : null;
results.push({ scrapeTime, title, link, img });
});
return results;
});
console.log(urls);
browser.close();
} catch (e) {
console.log(e);
browser.close();
}
})();
Upvotes: 1