Reputation: 43
I thought making a basic image scraper would be a fun project. The code down below works in the console on the website but I don't know how to get it to work from my app.js.
var anchors = document.getElementsByTagName('a');
var hrefs = [];
for(var i=0; i < anchors.length; i++){
var src = anchors[i].href;
if(src.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log(hrefs);
I thought using puppeteer was a good idea but my knowledge is too limited to determine whether that's right or not. This is my puppeteer code:
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
var anchors = await page.evaluate(() => document.getElementsByTagName('a'));
var hrefs = [];
for(var i=0; i < anchors.length; i++){ var img = anchors[i].href;
if(img.endsWith(".jpeg")) {
hrefs.push(anchors[i].href);
}} console.log({hrefs}, {img});
browser.close();
}
I understand that the last part of the code is wrong but I can't find a solid answer to what to be written instead.
Thank you for taking your time.
Upvotes: 2
Views: 296
Reputation: 13802
page.evaluate()
can only transfer serializable values (roughly, the values JSON can handle). As document.getElementsByTagName()
returns a collection of DOM elements that are not serializable (they contain methods and circular references), each element in the collection is replaced with an empty object. You need to return either serializable value (for example, an array of texts or href
attributes) or use something like page.$$(selector)
and ElementHandle
API.
Web API is not defined outside of the .evaluate()
argument function, so you need to place all the Web API part in .evaluate()
argument function and return serializable data from it.
const puppeteer = require("puppeteer");
async function scrape(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const data = await page.evaluate(() => {
const anchors = document.getElementsByTagName('a');
const hrefs = [];
for (let i = 0; i < anchors.length; i++) {
const img = anchors[i].href;
if (img.endsWith(".jpeg")) {
hrefs.push(img);
}
}
return hrefs;
});
console.log(data);
await browser.close();
}
Upvotes: 3