EddWebster
EddWebster

Reputation: 1

Difficulties creating CSV table in Google BigQuery

I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.

The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.

The error I am getting from the job history in the Big Query API is:

Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 2949; errors: 1. Please look into the errors[] collection for more details.

When I am uploading the CSV files, I am selecting the following:

  • file format: csv
  • table type: native table
  • auto detect: tried automatic and manual
  • partitioning: no partitioning
  • write preference: WRITE_EMPTY (cannot change this)
  • number of errors allowed: 0
  • ignore unknown values: unchecked
  • field delimiter: comma
  • header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).

Any help would be greatly appreciated.

Upvotes: 0

Views: 3968

Answers (4)

saifuddin778
saifuddin778

Reputation: 7277

This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.

Maybe something like:

cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'

Or, you can bind it to the condition (lets say if your number of columns should be 5):

ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;

Upvotes: 1

Ary Jazz
Ary Jazz

Reputation: 1656

I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.

You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:

bq --format=prettyjson show -j <JobID>

and you will be able to see additional logs for the error to help you understand the problem.

If the error happens only a few times you just can increase the number of errors allowed. If it happens many times you will need to manipulate your CSV file before you upload it.

Hope it helps

Upvotes: 0

Armin_SC
Armin_SC

Reputation: 2260

As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.

I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.

Upvotes: 0

Scicrazed
Scicrazed

Reputation: 602

It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:

Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48

With the following schema:

Name(String)    Class(Int64)    Dorm(String)    Room(String)    GPA(Float64)

Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.

I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.

Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.

Upvotes: 0

Related Questions