Febian Shah
Febian Shah

Reputation: 1007

Flexible schema with Google Bigquery

I have around 1000 files that have seven columns. Some of these files have a few rows that have an eighth column (if there is data).

What is the best way to load this into BigQuery? Do I have to find and edit all these files to either - add an empty eighth column in all files - remove the eighth column from all files? I don't care about the value in this column.

Is there a way to specify eight columns in the schema and add a null value for the eighth column when there is no data available.

I am using BigQuery APIs to load data if that might help.

Upvotes: 4

Views: 1235

Answers (4)

Febian Shah
Febian Shah

Reputation: 1007

Found a workable "hack". Ran a job for each file with the seven column schema and then ran another job on all files with eight columns schema. One of the job would complete successfully. Saving me time to edit each file individually and reupload 1000+ files.

Upvotes: 0

Jordan Tigani
Jordan Tigani

Reputation: 26617

You can use the 'allowJaggedRows' argument, which will treat non-existent values at the end of a row as nulls. So your schema could have 8 columns, and all of the rows that don't have that value will be null.

This is documented here: https://developers.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowJaggedRows

I've filed a doc bug to make this easier to find.

Upvotes: 4

Javier Ramirez
Javier Ramirez

Reputation: 4008

There is a possible solution here if you don't want to worry about having to change the CSV values (which would be my recommendation otherwise)

If the number of rows with an eight parameter is fairly small and you can afford to "sacrifice" those rows, then you can pass a maxBadRecords param with a reasonable number. In that case, all the "bad" rows (i.e. the ones not conforming to the schema) would be ignored and wouldn't be loaded.

If you are using bigquery for statistical information and you can afford to ignore those rows, it could solve your problem.

Upvotes: 0

N.N.
N.N.

Reputation: 3172

If your logs are in JSON, you can define a nullable field, and if it does not appear in the record, it would remain null. I am not sure how it works with CSV, but I think that you have to have all fields (even empty).

Upvotes: 1

Related Questions