Reputation: 7616
I am looking for a command line option to get a webpage, and execute the associated JavaScript code. In other words, call a headless browser via command line.
I can't use wget, it does not load and execute the associated JavaScript:
wget --load-cookies cookies.txt -O /dev/null https://example.com/update?run=1
Use case: we have web pages that read elastisearch indexes, do some data manipulation, and update elastisearch indexes. We'd like to do the update on an hourly basis via a cron job. We don't need to capture anything, e.g. no png capture, no HTML capture. We simply need to load the webpage and execute its JavaScript via a cron job, ideally something like run-headless https://example.com/update
. OS is CentOS 7.
I searched stackoverflow and did not find any answer satisfying my needs. selenium etc seem like an overkill:
Upvotes: 1
Views: 2473
Reputation: 7616
After some research I found a solution using puppeteer headless browser. Ideally I wanted a single command like run-headless https://example.com/update
, but login was required, hence driving the headless browser with puppeteer.
Installation steps for CentOS 7.6:
1. Install chrome
# cd
# mkdir install
# cd install/
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# yum localinstall vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-1.1.97.0-1.el7.x86_64.rpm
# yum localinstall vulkan-1.1.97.0-1.el7.x86_64.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/liberation-fonts-1.07.2-16.el7.noarch.rpm
# yum localinstall liberation-fonts-1.07.2-16.el7.noarch.rpm
# vi /etc/yum.repos.d/google-chrome.repo
# cat /etc/yum.repos.d/google-chrome.repo
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl.google.com/linux/linux_signing_key.pub
# yum install google-chrome-stable
2. Install node.js
# curl -sL https://rpm.nodesource.com/setup_14.x | sudo bash -
# yum install nodejs
3. Patch /etc/sysctl.conf
This was needed to run puppeteer without disabling the sandbox:
# echo "user.max_user_namespaces=15000" >> /etc/sysctl.conf
# reboot
4. Create run-hourly.js puppeteer script
This node script has to run as a regular user, not root:
$ cd /path/to/script
$ npm install --save puppeteer
$ npm install --save pending-xhr-puppeteer
$ mkdir userDataDir
$ vi run-hourly.js # (content below)
$ node run-hourly.js
File content of run-hourly.js
script:
const config = {
userDataDir: __dirname + '/userDataDir',
login: {
url: 'https://www.example.com/login/',
username: 'foobar',
password: 'secret',
},
pages: [{
url: 'https://www.example.com/update/hourly',
pdfFile: __dirname + '/page.pdf'
}]
};
const puppeteer = require('puppeteer');
const { PendingXHR } = require('pending-xhr-puppeteer');
(async() => {
// initialize headless browser
const browser = await puppeteer.launch({
headless: true, // run headless
dumpio: true, // capture console log to stdout
userDataDir: config.userDataDir // custom user data
});
const page = await browser.newPage();
const pendingXHR = new PendingXHR(page);
// login
await page.goto(config.login.url, {waitUntil: 'load'});
await page.type('#loginusername', config.login.username);
await page.type('#password', config.login.password);
await page.click('#signin');
await page.waitForNavigation();
// load pages of interest
await Promise.all(config.pages.map(async (pageCfg) => {
await page.goto(pageCfg.url, {waitUntil: 'networkidle0'}); // wait for page load
await page.setRequestInterception(true); // intercept requests for next line
await pendingXHR.waitForAllXhrFinished(); // wait for all requests to finish
await page.pdf({path: pageCfg.pdfFile}); // generate PDF from rendered page
}));
await browser.close();
})();
5. Add hourly job to cron
Install the cron job as same user as the script owner
$ crontab -l
$ crontab -e
25 * * * * cd /path/to/script && node run-hourly.js > hourly.log 2>&1
Upvotes: 2