Reputation: 59
I am currently using BigQuery to store large time-series datasets from a variety of systems with varying sample rates of 1 sample/hour to 1000 samples per second.
I apply a non-overlapping sliding window over each of the base tables to aggregate the data, which essentially down-samples it. I do this repeatedly over each aggregated table until the number of rows in the final table does not exceed 6000 rows. The reason I do this is so that I can view the time-series data at a number of different levels of detail.
Currently I load the data in via a csv upload to BigQuery using the Python SDK. The csv files are what the measurement systems output and I have no control over this. I carry out the table aggregations using the following code:
.
.
.
# Create an array of the field names in the table to be aggregated
field_names = [schema_field.name for schema_field in bq_client.get_table(f'{dataset}.{table}').schema]
# Create a string to be used in the query to extract the MIN and MAX of each field for a specified window
min_max_string = [f'MIN({field}) {field}_min, MAX({field}) {field}_max,' for field in field_names if field != index_col]
# Create table for each additional level of detail beyond the base table which has been predetermined
if num_lvls > 0:
for lvl in range(1, num_lvls + 1):
lod_dataset_name = f'{dataset}'
lod_table_name = f'{table}_lvl_{lvl}'
# Query string to aggregate base table for each level of detail.
query = f'''CREATE OR REPLACE TABLE
{lod_dataset_name}.{lod_table_name}
AS
With RankedData AS (
SELECT ROW_NUMBER() OVER (ORDER BY {index_col}) Rank, *
FROM `{dataset}.{table}`
)
SELECT DIV((Rank -1), {window_length**lvl} ) GroupId, MIN({index_col}) {index_col}, {(' ').join(query_string)} MIN(Rank) index
FROM RankedData
GROUP BY GroupId
ORDER BY GroupId'''
# Create query job via API request to Google
query_job = bq_client.query(query)
.
.
.
Notes on variables in above code: num_lvls is the number of tables of aggregations that need to be created, determined based on the window size and the maximum number of elements in the most aggregated table. index_col is the column which the data must be ordered by. In 99% of cases, the timestamp GroupId is used to group the data into 'bins' which we can then use to down-sample by taking the max and min per bin.
My concern is that although this seems to be working on the current data, I am concerned that if the datasets grow any larger I am going to run into issues relating to the Order By clauses. As I understand, BigQuery does not have a notion of ordered data due to data being split over various storage resources. So when I need to have the data sorted it will need to load all the data onto a single VM and carry out the sorting there. I imagine this will cause memory issues very soon.
The reason I need the data sorted is because the downsampling requires the data be sorted by timestamp. Each timestamp will be unique and the delta between timestamps is constant. I then push the data to a React front-end which plots the time-series data using D3. I only select a part of each table, depending on the users selected level of detail. This means I need the data to be ordered prior to sending off to the front-end.
The issue with using partitions and clustering is that for the data that is sampled hourly, I will have 24 rows per table (one sample per hour, partitioned by day). With a maximum of 4000 partitions per table I will exceed this limit after ~10 years of data. Unfortunately, some of the datasets go back 7 or 8 years already and therefore I will be up against that limit fairly soon. For the higher sampled data I think that partitions and clustering are the way to go.
One workaround I can think of, but yet to be tested, is to create a single partition based on ingestion time for the hourly data and then I can use clustering on that single partition, seeing that clustering currently requires a table to be partitioned. As I understand the documentation this should give me a sorted table.
Would this solution work? Or is there perhaps a more elegant solution?
Any pointers or references I may have missed would be greatly appreciated. If there is anything not clear please let me know and I can update the question accordingly.
Upvotes: 0
Views: 455
Reputation: 416
The concern you have is totally understandable because you are using BigQuery, refer the answer to this StackOverflow question [1], and if I have understood correctly, you are using BigQuery as a database to store data rather than using the data to do Analytics functions.
Therefore, I recommend to use another kind of database like Firestore
[2], this way you can run the query sorting the values without issues, since firestore has a timestamp value for each column, so you can use this column to sort your data. Here is a link to sorting data in Firestore [3].
In addition, I don't know very well your use case and why you create such a big amount of tables if you can create only one big table with the timestamp value and collect the data only for this table.
[1] bigquery resource limited exeeded due to order by
[2] https://cloud.google.com/firestore/
[3] https://firebase.google.com/docs/firestore/query-data/order-limit-data
Upvotes: 0