Optimal ETL process and platform

Question

I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/

What I need to do in my perception is the following:

Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)

My questions are the following. 1. Do you agree with my thought process or you would change some element to make it more efficient? 2. What do you think about running a python script to transform the data? Is there any simpler idea? 3. Which technology would you suggest to set up the database?

Thank you for the comments! Best regards, Bartek

Gonzalo P&#233;rez Fern&#225;ndez · Accepted Answer

If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:

CREATE TABLE dataset.pollution
(
  id NUMERIC,
  sampling_rate STRING,
  timestamp TIMESTAMP,
  location STRUCT<
  id NUMERIC,
  latitude FLOAT64,
  longitude FLOAT64,
  altitude FLOAT64,
  country STRING,
  exact_location INT64,
  indoor INT64
  >,
  sensor STRUCT<
    id NUMERIC,
    pin STRING,
    sensor_type STRUCT<
      id INT64,
      name STRING,
      manufacturer STRING
    >
  >,
  sensordatavalues ARRAY>
)

Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.

The code, would be something like this:

from google.cloud import storage
import json
from google.cloud import bigquery

client = bigquery.Client()

table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and 
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using 
json.loads() method
data = json.loads(blob.download_as_string(client=None))

# Partition the request in order to avoid reach quotas
partition = len(data)/4

cont = 0
data_aux = []
for part in data:
    if cont >= partition:
        errors = client.insert_rows(table, data_aux)  # Make an API request.
        if errors == []:
            print("New rows have been added.")
        else:
            print(errors)
        cont = 0
        data_aux = []
    # Avoid empty values (clean data)
    if part['location']['altitude'] is "":
        part['location']['altitude'] = 0
    if part['location']['latitude'] is "":
        part['location']['latitude'] = 0
    if part['location']['longitude'] is "":
        part['location']['longitude'] = 0
    data_aux.append(part)
    cont += 1

As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].

Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.

And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.

[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library

[2] https://cloud.google.com/storage

[3] https://cloud.google.com/bigquery/quotas

[4] https://cloud.google.com/bigquery/docs/visualize-data-studio

Optimal ETL process and platform

Answers (1)

Related Questions