Reputation: 1
I'm facing the following problem with my Puppeteer crawler: The site I'm scraping has results pages and we can navigate to the next page by clicking on an arrow at the bottom page (there is no easy href attached to the link, so we need to simulate a click on the button). On each page, I need to scrape all the items details (real estate cards / 30 cards by page).
The question is: how to navigate to all following pages, and scrape all cards on each page?
What I've done: on start url, I fill in a form to submit and getting the first 30 results to my request. Then, I loop on the selector matching the arrow at the page bottom and click on it, until the selector is not there. The navigation works, but the scraper doesn't get all the links for the cards on each page. So there is only the 30 first cards scraped, and then the scraper stop.
async function pageFunction(context) {
switch (context.request.userData.label) {
case 'START': return handleStart(context);
case 'DETAIL': return handleDetail(context);
}
async function handleStart({ log, page, customData }) {
// fill in form and submit to get the results page
await page.click(home.submitSearch);
// waiting for some selectors on first results page
await page.waitForSelector(searchResults.card);
await page.waitForSelector(searchResults.blockNavigation);
// navigate with pagination
while (await page.$(searchResults.nextPage) !== null) {
await page.waitForSelector(searchResults.card);
await page.waitForSelector(searchResults.blockNavigation);
await page.click(searchResults.nextPage)
}
}
async function handleDetail({ request, log, skipLinks, page }) {
const description = await page.$eval(descriptionSelector, (el => el.textContent));
return { description };
}
}
The 'START' label matches the start url with the form.
The 'DETAIL' label matches the links related to one card on the results page.
Any idea on how to handle this case?
Upvotes: 0
Views: 3231
Reputation: 468
It is a typical problem with web scraping. It looks website using some XHR requests to get additional data after you click on the next button.
It is hard to get you some advice without knowing the structure and knowledge of how the website works. But you can you these two approaches:
1) Use website XHRs request to get data. You can use the browser console to check out the XHR requests and replicate them in your crawler.
2) The approaches you tried to do. Just wait and click on the next button in the loop. And get all data after there is no more next button. I didn't see any issue in your current code, but it depends on pseudo-Urls and clickable selectors which you used in puppeteer scraper. If you set them right, it should work.
Anyway, there is an excellent tutorial for how you can do pagination in puppeteer scraper. You can check it out.
Upvotes: 1