Reputation: 3979
Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau. The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
Upvotes: 4
Views: 4705
Reputation: 158
You can write data to dynamic destinations (tables) - each table may contin separate schema version, eg: "table_v1", "table_v2", etc. Apache Beam or another procesing engine may be used. Next you can query the data with wildcard https://cloud.google.com/bigquery/docs/querying-wildcard-tables. "BigQuery uses the schema for the most recently created table that matches the wildcard as the schema for the wildcard table." - this could make the job, but you should ensure, that the table with the latest svhema version had been created last.
Upvotes: 0
Reputation: 3389
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
I have implemented this logic in my use case and it is working perfectly fine.
Upvotes: 1
Reputation: 207982
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
Upvotes: 5