Reputation: 23
I created a web scraping app, which checks for a certain problem on an ecommerce website.
What it does:
I wrapped that function in a cronjob function. On my local machine it runs fine.
Deployed like this:
It didnt work.
It worked. But only ran once.
Since it want to run that function several times per day, I need to fix the issue.
I have another app running which uses the same cronjob and notification function and it works on heroku.
Here's my code, if anyone is interested.
const puppeteer = require('puppeteer');
const nodemailer = require("nodemailer");
const CronJob = require('cron').CronJob;
let articleInfo ='';
const mailArr = [];
let body = '';
const testArr = [
'https://bxxxx..', https://b.xxx..', https://b.xxxx..',
];
async function sendNotification() {
let transporter = nodemailer.createTransport({
host: 'mail.brxxxxx.dxx',
port: 587,
secure: false,
auth: {
user: '[email protected]',
pass: process.env.heyBfPW2
}
});
let textToSend = 'This is the heading';
let htmlText = body;
let info = await transporter.sendMail({
from: '"BB Checker" <hey@baxxxxx>',
to: "[email protected]",
subject: 'Hi there',
text: textToSend,
html: htmlText
});
console.log("Message sent: %s", info.messageId);
}
async function boxLookUp (item) {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
],
});
const page = await browser.newPage();
await page.goto(item);
const content = await page.$eval('.set-article-info', div => div.textContent);
const title = await page.$eval('.product--title', div => div.textContent);
const orderNumber = await page.$eval('.entry--content', div => div.textContent);
// Check if deliveryTime is already updated
try {
await page.waitForSelector('.delivery--text-more-is-coming');
// if not
} catch (e) {
if (e instanceof puppeteer.errors.TimeoutError) {
// if not updated check if all parts of set are available
if (content != '3 von 3 Artikeln ausgewählt' && content != '4 von 4 Artikeln ausgewählt' && content != '5 von 5 Artikeln ausgewählt'){
articleInfo = `${title} ${orderNumber} ${item}`;
mailArr.push(articleInfo)
}
}
}
await browser.close();
};
const checkBoxes = async (arr) => {
for (const i of arr) {
await boxLookUp(i);
}
console.log(mailArr)
body = mailArr.toString();
sendNotification();
}
async function startCron() {
let job = new CronJob('0 */10 8-23 * * *', function() { // run every_10_minutes_between_8_and_11
checkBoxes(testArr);
}, null, true, null, null, true);
job.start();
}
startCron();
Upvotes: 2
Views: 921
Reputation: 57135
Assuming the rest of the code works (nodemailer, etc), I'll simplify the problem to focus purely on running a scheduled Node Puppeteer task in Heroku. You can re-add your mailing logic once you have a simple example running.
Heroku runs scheduled tasks using simple job scheduling or a custom clock process.
Simple job scheduling doesn't give you much control, but is easier to set up and potentially less expensive in terms of billable hours if you're running it infrequently. The custom clock, on the other hand, will be a continuously-running process and therefore chew up hours.
A custom clock process can do your cron task exactly, so that's probably the natural fit for this case.
For certain scenarios, you can sometimes workaround on the simple scheduler to do more complicated schedules by having it exit early or by deploying multiple apps.
For example, if you want a twice-daily schedule, you could have two apps that run the same task scheduled at different hours of the day. Or, if you wanted to run a task twice weekly, schedule it to run daily using the simple scheduler, then have it check its own time and exit immediately if the current day isn't one of the two desired days.
Regardless of whether you use a custom clock or simple scheduled task, note that long-running tasks really should be handled by a background task, so the examples below aren't production-ready. That's left as an exercise for the reader and isn't Puppeteer-specific.
package.json
:{
"name": "test-puppeteer",
"version": "1.0.0",
"description": "",
"scripts": {
"start": "echo 'running'"
},
"author": "",
"license": "ISC",
"dependencies": {
"cron": "^1.8.2",
"puppeteer": "^9.1.1"
}
}
Procfile
clock: node clock.js
clock.js
:const {CronJob} = require("cron");
const puppeteer = require("puppeteer");
// FIXME move to a worker task; see https://devcenter.heroku.com/articles/node-redis-workers
const scrape = async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const [page] = await browser.pages();
await page.setContent(`<p>clock running at ${Date()}</p>`);
console.log(await page.content());
await browser.close();
};
new CronJob({
cronTime: "30 * * * * *", // run every 30 seconds for demonstration purposes
onTick: scrape,
start: true,
});
Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):
heroku create
heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
(replace cryptic-dawn-48835
with your app name)
Deploy:
git init
git add .
git commit -m "initial commit"
heroku git:remote -a cryptic-dawn-48835
git push heroku master
Add a clock process:
heroku ps:scale clock=1
Verify that it's running with heroku logs --tail
. heroku ps:scale clock=0
turns off the clock.
package.json
:Same as above, but no need for cron
. No need for a Procfile
either.
task.js
:const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const [page] = await browser.pages();
await page.setContent(`<p>scheduled job running at ${Date()}</p>`);
console.log(await page.content());
await browser.close();
})();
Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):
heroku create
heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
(replace cryptic-dawn-48835
with your app name)
Deploy:
git init
git add .
git commit -m "initial commit"
heroku git:remote -a cryptic-dawn-48835
git push heroku master
Add a scheduler:
heroku addons:add scheduler:standard -a cryptic-dawn-48835
Configure the scheduler by running:
heroku addons:open scheduler -a cryptic-dawn-48835
This opens a browser and you can add a command node task.js
to run every 10 minutes.
Verify that it worked after 10 minutes with heroku logs --tail
. The online scheduler will show the time of next/previous execution.
See this answer for creating an Express-based web app on Heroku with Puppeteer.
Upvotes: 1
Reputation: 171
Had the same issue for 3 days now. Here something that might help: https://stackoverflow.com/a/55861535/13735374
Has to be done alongside the Procfile thing.
Upvotes: 1