Reputation:
I'm trying to build a web crawler with node and came across the puppeteer package which looks perfect for what I want. My end result is to gather all the links from a page, all of its text content, and then a screenshot of the page itself.
I ran the following and it appears to gather a large number of links, however on actual inspection of the site there are links that it is not gathering.
const puppeteer = require('puppeteer');
module.exports = () => {
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://pixabay.com/en/columbine-columbines-aquilegia-3379045/');
await page.screenshot({ path: 'myscreenshot.png', fullPage: true });
let text = await page.$eval('*', el => el.innerText.split(' '));
text = text.map(string => {
return string.replace(/[^\w\s]/gi, '');
});
let hrefs = await page.evaluate(() => {
const links = Array.from(document.querySelectorAll('a'))
return links.map(link => link.href);
});
console.log('done');
await browser.close();
})();
};
for example this link : /go/?t=image-details-shutterstock&id=699165328
is nowhere in the array of hrefs. What's worse is these are links that lead out of the site, the exact type of thing I want to do, otherwise I'm stuck only crawling the one site.
Is there a reason my script is only showing some of the links? is the querySelector too narrow or rejecting certain links?
Upvotes: 0
Views: 3302
Reputation: 19154
That links are generated by onclick
event, it saved in data-go
attribute, for example
<a data-go="image-details-shutterstock&id=458320033">
It only need to prepend /go/?t=
and to get it
return links.map(link => link.href || link.getAttribute('data-go'));
there are also empty link for menu like
<a><i class="icon icon_menu_user"></i></a>
Upvotes: 1