Reputation: 69

Get HTML page with javascript elements in bash script

I'm trying to get a website with some traffic statistics in a bash script. The actual values on this html page are written with javascript. So when viewing the webpage with a browser, I can see the actual values. But when looking on the html code e.g. with curl, the position of the value looks like this:

<div>
    <div class="agile_float">
        <script type="text/javascript">
            dw(IDS_statistics_aglie_month_volume_used);
            dw(common_colon);
        </script>
    </div>
    <div id="month_used_value" class="agile_td_ltr"></div>
</div>

The value of interest would stand in the now empty div with the id "month_used_value". I already found some hints to use PhantomJS, but I'm not sure if this is really the way to go?! Is there a simple way to gather theses values from a bash script?

Upvotes: 4

Answers (3)

Orikson

Reputation: 69

Thank you for your response. Unfortunatelly nothing worked for me... I'm on a Raspberry without gui and tried it with chromium and firefox. It seems that firefox does not even have something like a DOM-dump function. And chromium keeps crashing or at least not doing something that might help, like:

$ chromium-browser --headless --dump-dom --disable-gpu --print-to-pdf "http://192.168.8.1/html/statistic.html"
[0110/214637.782914:ERROR:browser_main_loop.cc(596)] Failed to put Xlib into threaded mode.
[0110/214640.321288:FATAL:gpu_data_manager_impl_private.cc(897)] The display compositor is frequently crashing. Goodbye.
Trace/Breakpoint ausgelöst

$ DISPLAY=:0 chromium-browser --headless --dump-dom --disable-gpu "http://192.168.8.1/html/statistic.html"
X Error:  BadDrawable
  Request Major code 55 ()
  ResourceID 0x0
  Error Serial #144
  Current Serial #146
X Error:  BadDrawable
  Request Major code 55 ()
  ResourceID 0x0
  Error Serial #144
  Current Serial #146
X Error:  BadDrawable
  Request Major code 55 ()
  ResourceID 0x0
  Error Serial #144
  Current Serial #146
[0110/214716.189128:FATAL:gpu_data_manager_impl_private.cc(897)] The display compositor is frequently crashing. Goodbye.
[0110/214716.198871:ERROR:broker_posix.cc(40)] Recvmsg error: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt (104)
Trace/Breakpoint ausgelöst

I also read about browsh and tried that. But this one is also extremely unstable, keeps crashing durnig startup and only loaded my page once. But as I was not able to find anything to output this in a machine readable format, I lookes around a bit more.

And I actually found something, that looks really nice for me. I'm trying to read out some traffic statistics of a LTE stick (Huawei E3531). I found a lot of values accessible through an API in the form of xml-files. Theses can be found on these URLs (192.168.8.1 is the IP of the network interface, that is provided by the LTE stick)

http://192.168.8.1/api/monitoring/month_statistics
http://192.168.8.1/api/monitoring/traffic-statistics
http://192.168.8.1/api/monitoring/status
http://192.168.8.1/api/device/basic_information
http://192.168.8.1/api/online-update/configuration
http://192.168.8.1/api/monitoring/converged-status
http://192.168.8.1/api/pin/status
http://192.168.8.1/api/monitoring/start_date

The month_statistics-page looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<CurrentMonthDownload>166033238</CurrentMonthDownload>
<CurrentMonthUpload>9679896</CurrentMonthUpload>
<MonthDuration>26391</MonthDuration>
<MonthLastClearTime>2019-12-30</MonthLastClearTime>
</response>

So I would assume that the CurrentMonthDownload-value is the used volume of this month. On the actual website it shows 167.57 MB. I'm still not 100% sure how this is calculated, but it should be accurate enough for me.

Upvotes: 0

Dmitry

Reputation: 1293

What you do is called Web Scraping - there's plenty info and tools on the Internet how to do it. Let me share my way:

1. First thing you need to do, as pointed by Hugo, is to read a DOM from the website.

For that you definitely need a headless browser. The good thing is many of front-end browsers support a headless mode too. E.g., if you have Google Chrome installed you could run it from the command-line in the headless mode too to get the entire (dynamic) content of the rendered page. e.g. on Mac:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --dump-dom --disable-gpu "https://www.google.com"

2. Once you get the rendered html content you need to process it reliably.

Parsing HTML (as well as any nested data format) with line-oriented tools like sed/awk/etc is error prone, so you need to find a utility capable of parsing (extracting) data from HTML which is html-aware.

I use jtm (developed by me). This is a lossless converter from HTML/XML to JSON (and back). The reason to convert HTML into JSON is simple - JSON is the (most) widely used data model and there's a ton of JSON parsers available these days.

Once it's converted to JSON you may use any tools to extract the required info from JSON - again, there's plenty offline tools available, but I use mine - jtc - it's super fast (but that only matters for really huge JSONs) and with jtc it's really easy to extract any JSON info.

E.g., the following extracts a front list of questions from the stackoverflow:

bash $ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --dump-dom --disable-gpu "https://stackoverflow.com/questions" 2>/dev/null | jtm | jtc -w'[class]:<question-hyperlink>:[-3]><P:'
"Component LoginContainerComponent is not part of any NgModule or the module has not been imported into your module"
"Removing an object with attributes from a list?"
"Is there a Predicate for operator instanceof?"
"How to use multiple document processor in vespa.ai in separate search chain?"
"How do I load an external JS library in Svelte/Sapper?"
"Why do gcc and clang place custom-sectioned const-funcptr symbols into writable sections when compiling with -fpic?"
"How does one solve “ 'CMySQLCursor' object has no attribute 'keys'”?"
"css video iframe good practice wordpress"
"auto complete address and navigate to that place"
"How do I trace an exact or find a specific value in a matplotlib graph?"
"Swift 5 How to get JSON multilayer data to append?"
"Excel wrong data format"
"Converting React function component to class component in React-JS"
"Can't Move PTZ Camera using ONVIF Protocol -Python Client"
"Can a TypeScript HttpClient accept a string that is not explicitly formatted as JSON?"
bash $

Once the required info is extracted, you can incorporate that command line you build into your bash script.

Upvotes: 1

Hugo Silva

Reputation: 6958

You could try and figure out what the dw function does and replicate it from your script. For example, if that function fetches some JSON containing the stats you want, and prints in the HTML, you could try and cURL said JSON directly.

But if you really need to read the result of the execution of a webpage, there is no way around it, you need a browser. Probably, as you already pointed out, a "headless" browser, such as PhanthomJS. Here is a list of alternatives - https://github.com/dhamaniasad/HeadlessBrowsers

Upvotes: 1

Get HTML page with javascript elements in bash script

Answers (3)

Related Questions