Reputation: 1267
Im am building a Spark job to migrate 6 years of ecommerce orders / events into our new BigQuery warehouse to complement the new streaming capability. The migrated data is in the same model as the new data and views will expose parts as needed.
We have 4 datasets:
data (raw data - hierarchical JSON) views (like DIMS over the raw data - generally flat but not always) pre_aggs (e.g currency covverting raw totals with some key rollup dims included - flat) reports (as you would think)
Unlike some of the new live data which is streamed in and unbounded, the migration of this historical data is batch and bounded (I dont have to concern myself with late arriving events / watermarking and duplicates for example). I can also partition the data manually via event time (orderPlacedTimestamp) and persist in the correct date partitioned table (suffix?). The full data is about 2 GBs and 2million rows compressed in BQ, so not massive but fairly complex structures with unnesting needed in the view layer. I have the options to write the raw and aggregates as materialsied tables from Spark, so really wanting to do this in-line with best practice and optimise for performance (query speed is more important and worth paying a bit extra for)
I found this really good blog on the SQL/Query lambda architecture which is some inspiration and will attempt to do similarly.
https://www.linkedin.com/pulse/building-robust-real-time-etl-google-bigquery-lambda-novozhilov
Im still wondering though how best to store / partition this data, and then construct time based queries to match. Week on week, Month on month reports are likely to be the most common.
My options seem:
Everything in one table - seems nice and simple no ongoing table management over time, but means a full scan for every query when often I just want to just go back a year or 2 at most from a point in time?
One table per time segment e.g. yearly, monthly
order_history_2017 or order_history_201701
We have a calendar lookup dim, which each row will have keys which the suffix above could be used — e.g 201701 for Jan 2017.
Monthly would mean 72 tables which seems quite a bit, maybe yearly is better?
For argument sake, say its monthly tables, what is the best way in BQ (standard SQL) to then query the right prefix of tables containing a contiguous timeline, constructing the table name on the fly with the right suffix (dynamically perhaps?
e.g say I want to query all orders (order has an orderPlacedTimestamp) between 2017–01-10 and 2017-02-10 - this would mean scanning (and union?) only the order_history_2017–01-10 and order_history_2017-02-10 tables in this case, the doing a BETWEEN like below:
SELECT *
FROM order_history_201701 UNION ALL
SELECT *
FROM order_history_201702
WHERE order.orderPlacedTimestamp BETWEEN DATE(“2017–01-10”) and DATE(“2017-02-10”)
I might then get scenarios where this historic data needs unioning with the ‘live’ (streaming) data too - wrapped up in a view like the article on the lambda design.
So many options with BQ! :)
Anyway, that is my current thinking, any words of wisdom on this topic would be hugely appreciated in relation to table design and optimised query construction.
Thanks heaps all!
Upvotes: 3
Views: 721
Reputation: 173046
My recommendation is to seriously consider native BigQuery functionality of Partitioned Tables
While streaming to partitioned tables have some limitations (you can stream to partitions within the last 30 days in the past and 5 days in the future relative to the current date) there is no such limitations for load or query jobs
Relatively long ago I have considered using this option for workaround of partitioning by column rather than date by mapping some attributed to date in between '0001-01-01' and '9999-12-31' (3,652,058 days - thus different attribute values to be partitioned by). The POC was successful conceptually but I still didn't like it especially having strong promise (at least that's how I felt that time) from Google Team to introduce partitioning by date or int column of the table. I decided to wait this.
Meantime, to refresh and double check that load or query into partitioned table still has no partition specific limitations (as it somehow sounded in one of SO Post) - I did quick test and you can see result below
Step 1 - Create partitioned Table - project.dataset.partitioned_table
I just used UI for this.
Step 2 - Insert query result into table's different partitions
#standardSQL
SELECT 13 AS a
with project.dataset.partitioned_table$YYYYMMDD as destination (you can use DML's INSERT for this)
I run this few times for range of dates between AC (0001-01-01) and end of the ? (9999-21-31)
Step 3 - Check result
#standardSQL
SELECT DATE(_partitiontime) AS partition_, a
FROM `project.dataset.partitioned_table`
ORDER BY a
The result was (have in mind - format of partition_ here is YYYYMMDD)
partition_ a
---------- --
2017-07-16 1
2017-07-16 2
2017-07-16 3
2017-07-16 4
2017-07-15 5
2017-07-14 6
2010-01-01 7
2001-01-01 8
1001-01-01 9
0001-01-01 10
4001-01-01 11
7001-01-01 12
9999-12-31 13
Upvotes: 1