R.A
R.A

Reputation: 111

Is there a way to provide schema or auto-detect schema when uploading csv from GCS to BigQuery?

I am trying to upload a csv file from Google Cloud Storage (GCS) to BigQuery (BQ) and auto-detect schema.

What I tried to do is enable auto-detect schema and enter the number of rows to skip in "Header rows to skip" option. I have 6 rows which contain descriptive information about the data which I need to skip. The 7th row is my actual header row.

According to Google's documentation in: https://cloud.google.com/bigquery/docs/schema-detect#auto-detect:

"The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field."

The problem with my CSV is that the above condition is not met in the sense that I have nulls in the rows.

Also, my CSV contains many rows which do not include any numerical values which I think adds an extra complexity for Google's schema auto detection.

The auto detect is not detecting the correct column names nor the correct field types. All field types are being detected as strings and column names assigned as such: string_field_0 , string_field_1, string_field_3 ,...etc. It is also passing the column names of my CSV as a row of data.

I would like to know what I can do to correctly upload this CSV to BQ with skipping the leading unwanted rows and having the correct schema (field names and field types).

Upvotes: 1

Views: 2276

Answers (3)

After reading some of the documentation, specifically the CSV header section I think what you're observing is the expected behavior.

An alternative would be to manually specify the schema for the data.

Upvotes: 2

R.A
R.A

Reputation: 111

Solved this by including my actual header row in the csv in the number of rows to skip.

I had 6 rows I actually needed to skip. The 7th row was my header (column names). I was entering 6 in the Header rows to skip.

When I entered 7 instead of 6, the schema was auto detected correctly.

Also, I realized that in this sentence in Google's documentation: "The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field.", nulls are considered values and so that was not actually causing a problem in the upload to BQ.

Hope this helps someone facing the same issue!

Upvotes: 0

Ugo
Ugo

Reputation: 159

You can try using tools like bigquery-schema-generator to generate the schema from your csv file and then use it in a bq load job for example.

Upvotes: 1

Related Questions