user13129846
user13129846

Reputation:

How do I run Headless Chrome in a Shell on Google Cloud Platform

I have read a little bit about Headless-Chrome and the Puppeteer API that Google has developed. I have seen a few answers on Stack Overflow so far about running Headless Chrome, and I also know all about Selenium for Testing Web-Pages and Scraping Web-Pages. I have written an HTML Parser, Search and Update Package myself, but I often run into problems when there is Java-Script on a web-page that has data I am trying to parse and retrieve.

According to Google's Documentation, Headless Chrome has been supported on Google Cloud Platform Shell (A Linux/Debian/BSD Type of UNIX Command Line, similar to Amazon Web Services). Today, I attempted to download a web-page using a simple Headless Chrome command line, but the Shell returned an error to me as follows:

@cloudshell:~$ chrome --headless --disable-gpu --dump-dom https://sepehr.irib.ir/?idc=32&idt=tv&idv=1

I typed this in an instance of the BASH Shell on GCP, and received this error.

[1] 498
[2] 499
bash: chrome: command not found
[2]+  Done                    idt=tv

The URL above is just a URL from this Stack Overflow question. I was just toying around to see if I could answer it. It is a very commonly asked type of "Web Scraping" question I read on the Web-Scraping tag. It's not too important (not to me, but probably to the OP it might be!) According to a few YouTube Videos, the Google Chrome Headless JSON API allows users to start an instance of Chrome such that it functions like a PaaS, not a UI that can be viewed. This seems pretty nice, and I am fully aware that Selenium Web-Scraping Technology has already taken advantage of this service. HOWEVER, I would just like to start accessing the JSON API from Java - without using Selenium - primarily to see if I can understand it, and to, hopefully, begin making JSON requests (in Java) to a Headless Chrome from a Google Cloud Shell instance without adding all the complexity of the Java Selenium Package.

This Stack Overflow question (and answers) seems to be a "partial duplicate" of my question, unfortunately the Google Help Pages state that since 2019 the service has become fully supported - and the answers here are from 2018. I suspect I should not have to perform a COMPLETE BUILD of Chrome in order to run a headless Chrome instance from the Command Line, but I could be wrong. In any case, newer answers to reflect 2019 and 2020 work done by Google Devs would help - and, more importantly, I would like to use "Plain Old Java Objects" to query the Browser, rather than using Pupeteer and Node.JS. I can deal with JSON very well in Java.

Is there a BASH 'sudo' command that I may use to get an instance of Chrome running in the Shell of GCP?

I have reviewed the suggested duplicates of this question, and do not know what to do... :)

Upvotes: 1

Views: 4078

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75735

First, you have to install headless chrome on your Cloud Shell. Here the script

export CHROME_BIN=/usr/bin/google-chrome
export DISPLAY=:99.0
sh -e /etc/init.d/xvfb start
sudo apt-get update
sudo apt-get install -y libappindicator1 fonts-liberation libasound2 libgconf-2-4 libnspr4 libxss1 libnss3 xdg-utils
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb

Then run your command. Don't forget to surround your URL with double quote " because the & run another thread in linux

/usr/bin/google-chrome-stable --headless --disable-gpu --dump-dom "https://sepehr.irib.ir/?idc=32&idt=tv&idv=1"

I got some errors that I fixed with this command

sudo apt --fix-broken install

Upvotes: 2

Related Questions