Reputation: 85
I am very new to puppeteer. I started yesterday and I'm trying to make a program that flips through a url that incrementally stores player id's one after the other and saves the player stats using neDB. There are thousands of links to flip through and I have found that if i use a for loop my computer basically crashes because 1,000 Chromiums try to open all at the same time. Is there a better way, or proper way to do this? Any advice would be appreciated.
const puppeteer = require('puppeteer');
const Datastore = require('nedb');
const database = new Datastore('database.db');
database.loadDatabase();
async function scrapeProduct(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
let attributes = [];
//Getting player's name
const [name] = await page.$x('//*[@id="ctl00_ctl00_ctl00_Main_Main_name"]');
const txt = await name.getProperty('innerText');
const playerName = await txt.jsonValue();
attributes.push(playerName);
//Getting all 12 individual stats of the player
for(let i = 1; i < 13; i++){
let vLink = '//*[@id="ctl00_ctl00_ctl00_Main_Main_SectionTabBox"]/div/div/div/div[1]/table/tbody/tr['+i+']/td[2]';
const [e1] = await page.$x(vLink);
const val = await e1.getProperty('innerText');
const skillVal = await val.jsonValue();
attributes.push(skillVal);
}
//creating a player object to store the data how i want (i know this is probably ugly code and could be done in a much better way)
let player = {
Name: attributes[0],
Athleticism: attributes[1],
Speed: attributes[2],
Durability: attributes[3],
Work_Ethic: attributes[4],
Stamina: attributes[5],
Strength: attributes[6],
Blocking: attributes[7],
Tackling: attributes[8],
Hands: attributes[9],
Game_Instinct: attributes[10],
Elusiveness: attributes[11],
Technique: attributes[12],
};
database.insert(player);
await browser.close();
}
//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/?id='+i+'§ion=Ratings';
scrapeProduct(link);
console.log("Player #" + i + " scrapped");
}
Upvotes: 0
Views: 380
Reputation: 3540
If you think that the issue with speed is reopening/closing the browser with each run, move browser to the global scope and initialize it to null. Then create a init function with something like:
async function init(){
if(!browser)
browser = await puppeteer.launch()
}
Allow pages to be passed to your scrapeProduct function. async function scrapeProduct(url)
becomes async function scrapeProduct(url,page)
. Replace await browser.close()
with await page.close()
. Now your loop will look like this:
//For loop to loop through 1000 player links... Url.com is swapped in here because the actual url is ridiculously long and not important.
await init();
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/?id='+i+'§ion=Ratings';
let page = await browser.newPage()
scrapeProduct(link,page);
console.log("Player #" + i + " scrapped");
}
await browser.close()
If you wanted to limit number of pages the browser will concurrently run you could create a function to do that:
async function getTotalPages(){
const allPages = await browser.pages()
return allPages.length
}
async function newPage(){
const MAX_PAGES = 5
await new Promise(resolve=>{
// check once a second to check on pages open
const interval = setInterval(async ()=>{
let totalPages = await getTotalPages()
if(totalPages< MAX_PAGES){
clearInterval(interval)
resolve()
}
},1000)
})
return await browser.newPage()
}
If you did this, in your loop you'd replace let page = await browser.newPage
with let page = await newPage()
Upvotes: 0
Reputation: 370689
The easiest tweak would be to wait for each link to finish before starting the next:
(async () => {
for(let i = 0; i <= 1000; i++){
let link = 'https://url.com/?id='+i+'§ion=Ratings';
await scrapeProduct(link);
console.log("Player #" + i + " scrapped");
}
})();
You could also allow only enough open as your computer can handle. This will require more resources, but will allow the process to finish faster. Figure out the limit you want, then do something like:
let i = 0;
const getNextLink = () => {
if (i > 1000) return;
let link = 'https://url.com/?id='+i+'§ion=Ratings';
i++;
return scrapeProduct(link)
.then(getNextLink)
.catch(handleErrors);
};
Promise.all(Array.from(
{ length: 4 }, // allow 4 to run concurrently
getNextLink
))
.then(() => {
// all done
});
The above allows for 4 calls of scrapeProduct
to be active at any one time - change the number as needed.
Upvotes: 1