Partitioning by date?

Question

We are experimenting with BigQuery to analyze user data generated by our software application.

Our working table consists hundreds of millions of rows, each representing a unique user "session". Each containing a timestamp, UUID, and other fields describing the user's interaction with our product during that session. We currently generate about 2GB of data (~10M rows) per day.

Every so often we may run queries against the entire dataset (about 2 months worth right now, and growing), However typical queries will span just a single day, week, or month. We're finding out that as our table grows, our single-day query becomes more and more expensive (as we would expect given BigQuery architecture)

What isthe best way to query subsets of of our data more efficiently? One approach I can think of is to "partition" the data into separate tables by day (or week, month, etc.) then query them together in a union:

SELECT foo from mytable_2012-09-01, mytable_2012-09-02, mytable_2012-09-03;

Is there a better way than this???

Michael Manoochehri · Accepted Answer

Hi David: The best way to handle this is to shard your data across many tables and run queries as you suggest in your example.

To be more clear, BigQuery does not have a concept of indexes (by design), so sharding data into separate tables is a useful strategy for keeping queries as economically efficient as possible.

On the flip side, another useful feature for people worried about having too many tables is to set an expirationTime for tables, after which tables will be deleted and their storage reclaimed - otherwise they will persist indefinitely.

Partitioning by date?

Answers (2)

Related Questions