Ordering large time series datasets in BigQuery for export

Question

I am currently using BigQuery to store large time-series datasets from a variety of systems with varying sample rates of 1 sample/hour to 1000 samples per second.

I apply a non-overlapping sliding window over each of the base tables to aggregate the data, which essentially down-samples it. I do this repeatedly over each aggregated table until the number of rows in the final table does not exceed 6000 rows. The reason I do this is so that I can view the time-series data at a number of different levels of detail.

Currently I load the data in via a csv upload to BigQuery using the Python SDK. The csv files are what the measurement systems output and I have no control over this. I carry out the table aggregations using the following code:

.
.
.

# Create an array of the field names in the table to be aggregated
field_names = [schema_field.name for schema_field in bq_client.get_table(f'{dataset}.{table}').schema]

# Create a string to be used in the query to extract the MIN and MAX of each field for a specified window
min_max_string = [f'MIN({field}) {field}_min, MAX({field}) {field}_max,' for field in field_names if field != index_col]

# Create table for each additional level of detail beyond the base table which has been predetermined
        if num_lvls > 0:

            for lvl in range(1, num_lvls + 1):
                lod_dataset_name = f'{dataset}'
                lod_table_name = f'{table}_lvl_{lvl}'

                # Query string to aggregate base table for each level of detail. 
                query = f'''CREATE OR REPLACE TABLE
                    {lod_dataset_name}.{lod_table_name} 
                    AS
                    With RankedData AS  (
                        SELECT ROW_NUMBER() OVER (ORDER BY {index_col}) Rank, * 
                        FROM `{dataset}.{table}`
                    )
                    SELECT DIV((Rank -1), {window_length**lvl} ) GroupId, MIN({index_col}) {index_col}, {(' ').join(query_string)} MIN(Rank) index
                        FROM  RankedData
                        GROUP BY GroupId
                        ORDER BY GroupId'''

                # Create query job via API request to Google
                query_job = bq_client.query(query) 
.
.
.

Notes on variables in above code: num_lvls is the number of tables of aggregations that need to be created, determined based on the window size and the maximum number of elements in the most aggregated table. index_col is the column which the data must be ordered by. In 99% of cases, the timestamp GroupId is used to group the data into 'bins' which we can then use to down-sample by taking the max and min per bin.

My concern is that although this seems to be working on the current data, I am concerned that if the datasets grow any larger I am going to run into issues relating to the Order By clauses. As I understand, BigQuery does not have a notion of ordered data due to data being split over various storage resources. So when I need to have the data sorted it will need to load all the data onto a single VM and carry out the sorting there. I imagine this will cause memory issues very soon.

The reason I need the data sorted is because the downsampling requires the data be sorted by timestamp. Each timestamp will be unique and the delta between timestamps is constant. I then push the data to a React front-end which plots the time-series data using D3. I only select a part of each table, depending on the users selected level of detail. This means I need the data to be ordered prior to sending off to the front-end.

The issue with using partitions and clustering is that for the data that is sampled hourly, I will have 24 rows per table (one sample per hour, partitioned by day). With a maximum of 4000 partitions per table I will exceed this limit after ~10 years of data. Unfortunately, some of the datasets go back 7 or 8 years already and therefore I will be up against that limit fairly soon. For the higher sampled data I think that partitions and clustering are the way to go.

One workaround I can think of, but yet to be tested, is to create a single partition based on ingestion time for the hourly data and then I can use clustering on that single partition, seeing that clustering currently requires a table to be partitioned. As I understand the documentation this should give me a sorted table.

Would this solution work? Or is there perhaps a more elegant solution?

Any pointers or references I may have missed would be greatly appreciated. If there is anything not clear please let me know and I can update the question accordingly.

Ordering large time series datasets in BigQuery for export

Answers (1)

Related Questions