Skibumdon
Skibumdon

Reputation: 3

Not able to download a web page that uses javascript

I have been trying to download a web page that I ultimately intend to scrape. The page uses Javascript, and has in their code a catch to test if javascript is enabled, and I keep getting it is not enabled.

I am trying to do it under wsl2 (ubuntu) on a windows 10 machine. I have tried with selenium, headless chrome, and axios, and am unable to figure out how to get it to execute the javascript.

As I want to put this into my crontab, I am not using any gui.

The website is

https://app.aquahawkami.tech/nfc?imei=359986122021410

Before I start to scrape the output, I figure I have to first get a good download, and that is where I am stuck.

Here is the javascript:

// index.js

const axios = require('axios');
const fs = require('fs');
axios.get('https://app.aquahawkami.tech/nfc?imei=359986122021410', {responseType: 'document'}).then(response => {
  fs.writeFile('./wm.html', response.data, (err) => {
        if (err) throw err;
        console.log('The file has been saved!');
    });
});

Selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get("https://app.aquahawkami.tech/nfc?imei=359986122021410")

page_source = driver.page_source
print(page_source)
fileToWrite = open("aquahawk_source.html", "w")
fileToWrite.write(page_source)
fileToWrite.close()
driver.close()

finally headless chrome:

`google-chrome --headless --disable-gpu --dump-dom https://app.aquahawkami.tech/nfc?imei=359986122021410

`

Upvotes: 0

Views: 99

Answers (1)

GTK
GTK

Reputation: 1906

Here's an example of how you can get the data from the api every 6 hours:

async function getMeterData(imei){
  /* 
  this is a template string, it allows constructing/joinning strings cleanly
  in this case it will insert the function argument `imei` into the string
  eg:`https://api.aquahawkami.tech/endpoint?imei=${imei}` --> https://api.aquahawkami.tech/endpoint?imei=359986122021410
  
  more info : 
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals
  */
  const url = `https://api.aquahawkami.tech/endpoint?imei=${imei}`;
   
  /* 
  this makes a fetch request (async) to the url.
  r.json() is called after the request(promise) fullfils, and .json() itself returns a promise
  `await` waits until said promise fullfils (now `data` contains the json object)
  
  more info: 
    https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/async_function
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise
  */
  const data = await fetch(url).then(r => r.json());

  /*
  console.log() just prints a message to the console, something like python's print()
  consol.log(data) will print the data (object) to the console.

  more info:
    https://developer.mozilla.org/en-US/docs/Web/API/console/log_static
  */
  console.log(data);

  /*
  this is object destructuring, it is equivalent to this:
  const slp_time = data.attributes.slp_time;
  const reading = data.attributes.reading;
  const lastUpdateTime = data.lastUpdateTime;
  
  more info:
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Destructuring_assignment
  */
  const {attributes: {slp_time, reading}, lastUpdateTime} = data;

  /*
  here I convert `slp_time` from a string to an int (parseInt)
  then multiply it by the length of the `reading` array 
  then multiply by 1000 to convert from seconds to milliseconds (Date and setTimeout use milliseconds)

  this is simply trying to "dynamically" calculate the 6 hour interval
  if you wish you can replace this line entirely with a hardcoded value
  eg: const interval_ms = 21600000;
  or: const interval_ms = 6 * 60 * 60 * 1000;
  */
  const interval_ms =  parseInt(slp_time) * reading.length * 1000; 
  
  /*
  `new Date(lastUpdateTime)` will create a new Date object from the string `lastUpdateTime` (data.lastUpdateTime)
  `.getTime()` will return that date as timestamp (in milliseconds)
  by adding `interval_ms` to that last update timestamp we should get the next update timestamp

  more info:
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date
  */
  const nextUpdateTime = new Date(lastUpdateTime).getTime() + interval_ms;

  /*
  `new Date().getTime()` will return the CURRENT date and time as a timestamp
  `nextUpdateTime - new Date().getTime()` calculates the different between current time and nextUpdateTime
   now we know how long we have to wait from NOW until the next update
  */
  const wait = nextUpdateTime - new Date().getTime();

  /*
  `setTimeout()` sets a timer and calls the supplied function when the timer runs out,
  it takes a function (to call) and a timeout in milliseconds
  in this case it will call this arrow function: () => getMeterData(imei) after `wait` runs out.
  so basically `getMeterData` creates a timer to call itself after a 6 hours and does that indefinitely

  more info:
    https://developer.mozilla.org/en-US/docs/Web/API/setTimeout
    https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Functions/Arrow_functions
  */
  setTimeout(() => getMeterData(imei), wait);
}

// here we make the first call to the `getMeterData` function with '359986122021410' as the imei argument
getMeterData('359986122021410');

You might prefer to use a scheduler/cronjob instead of setTimeout.

There is also this endpoint that has slightly different data: https://api.aquahawkami.tech/meter?meter=83837540

The difference between the two addresses is that the second one /meter includes a reads and reading arrays that have slightly different formats (string vs int & timestamp), and it also seems to includes the whole data from the first address /endpoint, but the actual values of the readings are the same across arrays/addresses; so you can use whichever one is more convenient.

Upvotes: 0

Related Questions