Reputation:
I have read a little bit about Headless-Chrome and the Puppeteer API that Google has developed. I have seen a few answers on Stack Overflow so far about running Headless Chrome, and I also know all about Selenium for Testing Web-Pages and Scraping Web-Pages. I have written an HTML Parser, Search and Update Package myself, but I often run into problems when there is Java-Script on a web-page that has data I am trying to parse and retrieve.
According to Google's Documentation, Headless Chrome has been supported on Google Cloud Platform Shell (A Linux/Debian/BSD Type of UNIX Command Line, similar to Amazon Web Services). Today, I attempted to download a web-page using a simple Headless Chrome command line, but the Shell returned an error to me as follows:
@cloudshell:~$ chrome --headless --disable-gpu --dump-dom https://sepehr.irib.ir/?idc=32&idt=tv&idv=1
I typed this in an instance of the BASH Shell on GCP, and received this error.
[1] 498
[2] 499
bash: chrome: command not found
[2]+ Done idt=tv
The URL above is just a URL from this Stack Overflow question. I was just toying around to see if I could answer it. It is a very commonly asked type of "Web Scraping" question I read on the Web-Scraping tag. It's not too important (not to me, but probably to the OP it might be!) According to a few YouTube Videos, the Google Chrome Headless JSON API allows users to start an instance of Chrome such that it functions like a PaaS, not a UI that can be viewed. This seems pretty nice, and I am fully aware that Selenium Web-Scraping Technology has already taken advantage of this service. HOWEVER, I would just like to start accessing the JSON API from Java - without using Selenium - primarily to see if I can understand it, and to, hopefully, begin making JSON requests (in Java) to a Headless Chrome from a Google Cloud Shell instance without adding all the complexity of the Java Selenium Package.
This Stack Overflow question (and answers) seems to be a "partial duplicate" of my question, unfortunately the Google Help Pages state that since 2019 the service has become fully supported - and the answers here are from 2018. I suspect I should not have to perform a COMPLETE BUILD of Chrome in order to run a headless Chrome instance from the Command Line, but I could be wrong. In any case, newer answers to reflect 2019 and 2020 work done by Google Devs would help - and, more importantly, I would like to use "Plain Old Java Objects" to query the Browser, rather than using Pupeteer
and Node.JS
. I can deal with JSON
very well in Java.
Is there a BASH 'sudo' command that I may use to get an instance of
Chrome
running in the Shell of GCP?
I have reviewed the suggested duplicates of this question, and do not know what to do... :)
Upvotes: 1
Views: 4078
Reputation: 75735
First, you have to install headless chrome on your Cloud Shell. Here the script
export CHROME_BIN=/usr/bin/google-chrome
export DISPLAY=:99.0
sh -e /etc/init.d/xvfb start
sudo apt-get update
sudo apt-get install -y libappindicator1 fonts-liberation libasound2 libgconf-2-4 libnspr4 libxss1 libnss3 xdg-utils
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb
Then run your command. Don't forget to surround your URL with double quote "
because the &
run another thread in linux
/usr/bin/google-chrome-stable --headless --disable-gpu --dump-dom "https://sepehr.irib.ir/?idc=32&idt=tv&idv=1"
I got some errors that I fixed with this command
sudo apt --fix-broken install
Upvotes: 2