Cloud Dataflow: Generating tables in BigQuery

I have a pipeline which reads streaming data from Cloud Pub/Sub, this data is processed by Dataflow, then saved into one large BigQuery table, each Pub/Sub message includes an associated account_id. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from that associated account_id?

I know that this can be done by updating the pipeline for each new account. But in an ideal world, Cloud Dataflow would generate these tables within the code programmatically.

Upvotes: 3

Answers (2)

Mikhail Berlyant

Reputation: 173046

wanted to share few options I see

Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios

Option 2 – you can come up with hashing your account_id into predefined number of buckets. In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data

Upvotes: 1

Mingyu Zhong

Reputation: 531

The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.

Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.

Upvotes: 1

Cloud Dataflow: Generating tables in BigQuery

Answers (2)

Related Questions