Reputation: 11
I'm using Pands to_gbq to append a dataframe to a big query table as I have done successfully in the past using this (I only explicitly declared one field in the schema so it would recognize it as a date, otherwise it forced it to be a string):
schema = [{'name': 'Week', 'type': 'DATE'}]
def load_to_BQ():
dataframe.to_gbq(destination_table='Table.my_table',
project_id='myprojectid',
table_schema=schema,
if_exists='append')
When running this, I get the following error:
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
I'm confused because I have uploaded and appended dataframes to the same BQ table before using this code. I checked the schema against the dataframe columns and they all match and are in the right order. I suspect the culprit is a date field called "Week" in the dataframe, but even in BQ the "Week" field is listed as DATE. I've cast the field to datetime using:
dataframe['Week'] = pd.to_datetime(dataframe['Week'], format='%m-%d-%y').dt.date
When I check the schema type with schema.generate_bq_schema(dataframe)
, the "Week" field comes back as TIMESTAMP
. I've seen suggestions saying to use "TIMESTAMP" for BQ instead of "DATE", but when I changed that in the schema, I got the same error. Can anyone point out what Im doing wrong? This is the full error message:
InvalidSchema Traceback (most recent call last)
<ipython-input-117-fb947996ea53> in <module>
30 answer = input("Are you sure you want to load to BigQuery? (y/n)")
31 if answer == "y":
---> 32 load_to_BQ()
33 else:
34 print("Load failed.")
<ipython-input-117-fb947996ea53> in load_to_BQ()
12 # dataframe, table_id, job_config=job_config
13 # )
---> 14 dataframe.to_gbq(destination_table='table.my_table',
15 project_id='myprojectid',
16 table_schema=schema,
~\anaconda3\lib\site-packages\pandas\core\frame.py in to_gbq(self, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
1708 from pandas.io import gbq
1709
-> 1710 gbq.to_gbq(
1711 self,
1712 destination_table,
~\anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
209 ) -> None:
210 pandas_gbq = _try_import()
--> 211 pandas_gbq.to_gbq(
212 dataframe,
213 destination_table,
~\anaconda3\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, verbose, private_key)
1074 original_schema, table_schema
1075 ):
-> 1076 raise InvalidSchema(
1077 "Please verify that the structure and "
1078 "data types in the DataFrame match the "
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Upvotes: 1
Views: 1611
Reputation: 1376
Pandas has different data types than bigquery. Specifically, while bigquery supports DATE, DATETIME, TIME, and TIMESTAMP
, pandas only supports numpy's datetime64
. Sometimes pandas will support your data as datetime64
and sometimes it won't. For example datetime64
can't store 9999-12-31. Worse, if any column in an individual CSV is only nulls then pandas will make the column an integer.
Pandas is not an ETL tool. It duck-types like one much of the time, but it is not a general purpose solution for the type of problem that you are attempting to solve.
If you want to use pandas you'll need to:
CREATE TABLE
DDL
statement)INSERT
statement works.Upvotes: 1