Edward Stokes
Edward Stokes

Reputation: 11

Pandas to BigQuery upload fails due to InvalidSchema error

I'm using Pands to_gbq to append a dataframe to a big query table as I have done successfully in the past using this (I only explicitly declared one field in the schema so it would recognize it as a date, otherwise it forced it to be a string):

schema = [{'name': 'Week', 'type': 'DATE'}]

def load_to_BQ():
    dataframe.to_gbq(destination_table='Table.my_table',
                project_id='myprojectid',
                table_schema=schema,
                if_exists='append')

When running this, I get the following error:

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

I'm confused because I have uploaded and appended dataframes to the same BQ table before using this code. I checked the schema against the dataframe columns and they all match and are in the right order. I suspect the culprit is a date field called "Week" in the dataframe, but even in BQ the "Week" field is listed as DATE. I've cast the field to datetime using:

dataframe['Week'] = pd.to_datetime(dataframe['Week'], format='%m-%d-%y').dt.date

When I check the schema type with schema.generate_bq_schema(dataframe), the "Week" field comes back as TIMESTAMP. I've seen suggestions saying to use "TIMESTAMP" for BQ instead of "DATE", but when I changed that in the schema, I got the same error. Can anyone point out what Im doing wrong? This is the full error message:

InvalidSchema                             Traceback (most recent call last)
<ipython-input-117-fb947996ea53> in <module>
     30     answer = input("Are you sure you want to load to BigQuery? (y/n)")
     31     if answer == "y":
---> 32         load_to_BQ()
     33     else:
     34         print("Load failed.")

<ipython-input-117-fb947996ea53> in load_to_BQ()
     12 #         dataframe, table_id, job_config=job_config
     13 #     )
---> 14     dataframe.to_gbq(destination_table='table.my_table',
     15                     project_id='myprojectid',
     16                     table_schema=schema,

~\anaconda3\lib\site-packages\pandas\core\frame.py in to_gbq(self, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
   1708         from pandas.io import gbq
   1709 
-> 1710         gbq.to_gbq(
   1711             self,
   1712             destination_table,

~\anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
    209 ) -> None:
    210     pandas_gbq = _try_import()
--> 211     pandas_gbq.to_gbq(
    212         dataframe,
    213         destination_table,

~\anaconda3\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, verbose, private_key)
   1074                 original_schema, table_schema
   1075             ):
-> 1076                 raise InvalidSchema(
   1077                     "Please verify that the structure and "
   1078                     "data types in the DataFrame match the "

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

Upvotes: 1

Views: 1611

Answers (1)

Steven Ensslen
Steven Ensslen

Reputation: 1376

Pandas has different data types than bigquery. Specifically, while bigquery supports DATE, DATETIME, TIME, and TIMESTAMP, pandas only supports numpy's datetime64. Sometimes pandas will support your data as datetime64 and sometimes it won't. For example datetime64 can't store 9999-12-31. Worse, if any column in an individual CSV is only nulls then pandas will make the column an integer.

Pandas is not an ETL tool. It duck-types like one much of the time, but it is not a general purpose solution for the type of problem that you are attempting to solve.

If you want to use pandas you'll need to:

  1. create the final table without pandas using a tool which allows you to control the data types (ie. a CREATE TABLE DDL statement)
  2. Use the code you have to insert into a temporary table (either a table in an expiring dataset or set the expiry on the table when you create it).
  3. Use something other than pandas to copy the data from the temp table to the final table. A bigquery INSERT statement works.

Upvotes: 1

Related Questions