Reputation: 155
I want to learn some web scraping and I found out puppeteer library. I chose puppeteer over other tools because I have some background in JS.
I also found this website whose purpose is to be scraped. I've managed to get the info of every book in every page. Here's what I did:
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto(url); // http://books.toscrape.com/
const json = [];
let next = await page.$('.pager .next a'); // next button
while (next) {
// get all articles
let articles = await page.$$('.product_pod a');
// click on each, get data and go back
for (let index = 0; index < articles.length; index++) {
await Promise.all([
page.waitForNavigation(),
articles[index].click(),
]);
const data = await page.evaluate(getData);
json.push(data);
await page.goBack();
articles = await page.$$('.product_pod a');
}
// click the next button
await Promise.all([
page.waitForNavigation(),
page.click('.pager .next a'),
]);
// get the new next button
next = await page.$('.pager .next a');
}
fs.writeFileSync(file, JSON.stringify(json), 'utf8');
await browser.close();
})();
The function getData
passed to page.evaluate
returns an object with the desired properties:
function getData() {
const product = document.querySelector('.product_page');
return {
title: product.querySelector('h1').textContent,
price: product.querySelector('.price_color').textContent,
description:
document.querySelector('#product_description ~ p')
? document.querySelector('#product_description ~ p').textContent
: '',
category:
document.querySelector('.breadcrumb li:nth-child(3) a')
? document.querySelector('.breadcrumb li:nth-child(3) a').textContent
: '',
cover:
location.origin +
document.querySelector('#product_gallery img')
.getAttribute('src').slice(5),
};
}
When I finally execute the script, everything goes well except that in the final json
file I have duplicated records. This is, every book has two entries within the file. I know that the script could be better but what do you think is happening with this approach?
Upvotes: 0
Views: 495
Reputation: 84465
Your selector in this line:
let articles = await page.$$('.product_pod a');
is matching more than is required. You get 40 not 20 ( a
child tags for image container are also included which are the same as the child a
of h3)
You want to restrict to the h3 a
:
let articles = await page.$$('.product_pod h3 a');
Upvotes: 2