chews
chews

Reputation: 2659

data aggregation and averaging on 200 billion records

The records started out in AVRO files that were created daily with the following schema. There are 20 different attribute types stored in the "attribute_key" and "attribute_value" records, a timestamp, and device_id are also included in each measurement.

"fields" : [
{"type":"string", "name":"device_id"},
{"type":"string", "name":"record_date"},
{"type":"string", "name":"attribute_key"},
{"type":"string", "name":"attribute_value"}]

I have been able to take the daily files and load them into month seperated tables in bigquery.

device_attributes201501
device_attributes201502
device_attributes201503
device_attributes201504
device_attributes201505
device_attributes201506
device_attributes201507
device_attributes201508
device_attributes201509
device_attributes201510
device_attributes201511
device_attributes201512

My problem is two-fold,

I need to create a table that consists all of the unique device_ids collected across all time, and the latest attribute value for each value type.

   device_id, record_date, attribute_key, attribute_value
   abc123     2015-10-11   attribute_1    5
   abc123     2015-11-11   attribute_1    5
   abc123     2015-12-11   attribute_1    10
   abc123     2015-10-11   attribute_1    0
   abc456     2015-10-11   attribute_1    0
   abc789     2015-10-11   attribute_1    0
   abc123     2015-11-11   attribute_1    0
   abc456     2015-11-11   attribute_1    0
   abc789     2015-11-11   attribute_1    6
   abc123     2015-10-11   attribute_2    blue
   abc123     2015-11-11   attribute_2    red
   abc123     2015-12-11   attribute_2    red
   abc456     2015-12-11   attribute_2    blue
   abc789     2015-12-11   attribute_2    green

A weekly, monthly, and 90 day average would also need to be computed for some attributes. (attribute_3 is the average of samples collected)

   device_id, last_update, attribute_1, attribute_2
   abc123     2015-12-11   6            red
   abc456     2015-12-11   0            blue
   abc789     2015-12-11   3            green

I'm curious how best to take take this on, and I have no idea where to go from here. The data is in bigquery now, I have access to the full suite of google clould tools... like dataflow, or anything else.

The data was originally in an S3 bucket, so I can process it with any of the solutions on AWS.

I just don't know what's the smartest way to go about doing it.

Upvotes: 0

Views: 258

Answers (1)

Alex Amato
Alex Amato

Reputation: 1725

Hopefully some of these links will help you. Creating a table https://cloud.google.com/bigquery/docs/tables#creating-a-table

BigQuery Web UI https://cloud.google.com/bigquery/bigquery-web-ui

How to create a table from a query (blog post from a user). This one suggests that you can use the BQ WebUI and specify a destination table. I couldn't find that in the official docs, so not sure if this works. If not you'll need to setup the API and write a bit of code as shown in their example above. https://chartio.com/resources/tutorials/how-to-create-a-table-from-a-query-in-google-bigquery/

Upvotes: 1

Related Questions