Dealing with evolving schemas

Question

We are a gaming company that stores events (Up to 1 giga events per day) to bigquery. Events are sharded over month and application in order to lower query costs.

Now to our problem.

Our current solution supports adding new type of events which leads to new versions of the table schema. This versions has also been added to the tables.

I.e. events_app1_v2_201308 and events_app1_v2_201308

If we add events with new column types in september we will also get events_app1_v3_201309

We have written code that finds out involved tables (for a date range) and makes a union of them a'la bigquery's comma separeted FROM clause.

But I just realised that this will NOT work when we make unions over different versions of the event tables.

Anyone that has a smart solution of how to deal with this!?

Right now we are investigating if JSON structures could help us. The current solution is just flat columns. [timestamp, eventId, value, value, value, ...]

From https://developers.google.com/bigquery/query-reference#from

Note: Unlike many other SQL-based systems, BigQuery uses the comma syntax to indicate table unions, not joins. This means you can run a query over several tables with compatible !? schemas as follows:

Jordan Tigani · Accepted Answer

You should be able to modify the table schema of the old tables to add columns, then the union should match. Note that you can only add columns, not remove them. You can use the tables.patch() method to do this, or bq update --schema

Moreover, as long as the new fields aren't marked REQUIRED, they should be considered compatible. If this is not the case, however, it would be a bug -- let us know if that is what you're experiencing.

Dealing with evolving schemas

Answers (1)

Related Questions