Mark Sanders
Mark Sanders

Reputation: 73

Missing row with Google BigQuery when using permanent external table pointing to GCS file

I'm working with BigQuery and experementing with using it to query a CSV file in a bucket in Google Cloud Storage. I came across some strange behavior where -- only when there are 3 rows in the CSV, and the first row contains a string in any field -- when querying the table, the first row is missing.

Steps to reproduce:

Create myfile.csv:

testin,2,2
testing3,3,4

Copy file to GCS:

gsutil cp myfile.csv gs://bucket/

Create external table pointing to gs://bucket/myfile.csv:

bq mk --external_table_definition=Field1:STRING,Field2:STRING,Field3:INTEGER@CSV=gs://bucket/myfile.csv dataset.table

Query the table to ensure it's working:

bq query "SELECT * FROM dataset.table;"

Should output the following:

Waiting on biquery_job_id_1234567 ... (0s) Current 
status: DONE
+----------+--------+--------+
|  Field1  | Field2 | Field3 |
+----------+--------+--------+
| testin   | 2      |      2 |
| testing3 | 3      |      4 |
+----------+--------+--------+

All good to move on to the bug..

Change myfile.csv to look like the following:

1,h,3
testin,2,2
testing3,3,4

Overwrite gs://bucket/myfile.csv:

gsutil cp myfile.csv gs://bucket/

Query dataset.table again:

bq query "SELECT * FROM dataset.table;"

Outputs:

Waiting on bigquery_job_78901234 ... (0s) Current status: DONE
+----------+--------+--------+
|  Field1  | Field2 | Field3 |
+----------+--------+--------+
| testin   | 2      |      2 |
| testing3 | 3      |      4 |
+----------+--------+--------+

Lets make sure that the GCS file has the correct data:

gsutil cat gs://bucket/myfile.csv

Outputs:

1,h,3
testin,2,2
testing3,3,4

Ok, interesting.

Change that middle value in the top row so that myfile.csv looks as follows:

1,2,3
testin,2,2
testing3,3,4

Overwrite the GCS file:

gsutil cp myfile.csv gs://bucket/

Query the table:

bq query "SELECT * FROM dataset.table;"

Outputs:

Waiting on bigquery_job_4567890 ... (0s) Current status: DONE
+----------+--------+--------+
|  Field1  | Field2 | Field3 |
+----------+--------+--------+
| 1        | 2      |      3 |
| testin   | 2      |      2 |
| testing3 | 3      |      4 |
+----------+--------+--------+

Does anyone have any insight as to what scenarios may cause the first row to become missing if it contains a string within the first 2 fields?

Thanks,

Upvotes: 3

Views: 1015

Answers (1)

Tim Swena
Tim Swena

Reputation: 14786

There is a parameter called csvOptions.skipLeadingRows which is used to specify the number of "header rows" in a CSV file.

If skipLeadingRows is unspecified, BigQuery tries to autodetect the number of header rows. Setting skipLeadingRows manually to 0 should disable this behavior.

Upvotes: 1

Related Questions