Yordan
Yordan

Reputation: 131

Scraping the same page forever using puppeteer

Doing scraping. How can I stay on a page and read the content to search for data every xx seconds without refresh the page? I use this way but the pc crashes after some time. Any ideas on how to make it efficient? I would like to achieve it without using while (true). The readOdds function does not always delay the same time.

//...
while(true){
   const html = await page.content();
   cant = await readOdds(html); // some code with the html
   console.info('Waiting 5 seconds to read again...');
   await page.waitFor(5000);
}

this is a section

async function readOdds(htmlPage){
    try {
        var savedat = functions.mysqlDateTime(new Date());
        var pageHtml=htmlPage.replace(/(\r\n|\n|\r)/gm,"");
        var exp_text_all = /<coupon-section(.*?)<\/coupon-section>/g;
        var leagueLinksMatches = pageHtml.match(exp_text_all);
        var cmarkets = 0;

        let reset = await mysqlfunctions.promise_updateMarketsCount(cmarkets, table_markets_count, site);
        console.log(reset);

        if(leagueLinksMatches == null){
            return cmarkets;
        }
        for (let i = 0; i < leagueLinksMatches.length; i++) {
            const html = leagueLinksMatches[i];
            var expc = /class="title ellipsis-text">(.*?)<\/span/g;
            var nameChampionship = functions.getDataInHtmlCode(String(html).match(expc)[0]);

            var idChampionship = await mysqlfunctions.promise_db_insert_Championship(nameChampionship, gsport, table_championship);
           

            var exp_text = /<ui-event-line(.*?)<\/ui-event-line>/g;
            var text = html.match(exp_text);
            // console.info(text.length);

            for (let index = 0; index < text.length; index++) {
                const element = text[index];               
.... 
  

Upvotes: 3

Views: 2712

Answers (1)

Md. Abu Taher
Md. Abu Taher

Reputation: 18826

Simple Solution with recursive callback

However before we go into that, you can try to run the function itself instead of while which will loop forever without any proper control.

const readLoop = async() => {
  const html = await page.content();
  cant = await readOdds(html);
  return readLoop() // run the loop again
}

// invoke it for infinite callbacks without any delays at all
await readLoop();

Which will run the same block function continuously, without any delay, as long as your readOdds function returns. You won't have to use page.waitFor and while.

Memory leak prevention

For advanced cases where you have respawn over a period of time, Queue like bull and process manager like PM2 comes into play. However, queue will void your without refresh the page? part of your question.

You definitely should use pm2 though.

The usage is as follows,

npm i -g pm2
pm2 start index.js --name=myawesomeapp // or your app file

There are few useful arguments,

  • --max-memory-restart 100M, It can limit memory usage to 100M and restart itself.
  • --max-restarts 50, It will stop working once it restarts 50 times due to error (or memory leak).

You can check the logs using pm2 logs myawesomeapp as you set the name above.

Upvotes: 3

Related Questions