Reputation: 81
pandas.to_gbq()
has recently started returning an error when I attempt to append a dataframe to a BigQuery table, despite the df schema/data types being the exact same as those of the BigQuery table.
Code snippet below:
df.to_gbq(destination_table = PROCESSED_DATA_TABLE_NAME,
project_id = PROJECT_NAME,
if_exists = 'append')
Returns:
File ~\Documents\DartsModel\update_processed_visit_data\main_dev.py:152 in <module>
df.to_gbq(destination_table = PROCESSED_DATA_TABLE_NAME,
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas\core\frame.py:2054 in to_gbq
gbq.to_gbq(
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas\io\gbq.py:212 in to_gbq
pandas_gbq.to_gbq(
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas_gbq\gbq.py:1198 in to_gbq
connector.load_data(
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas_gbq\gbq.py:591 in load_data
chunks = load.load_chunks(
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas_gbq\load.py:238 in load_chunks
load_parquet(
File ~\Anaconda3\envs\darts_model\lib\site-packages\pandas_gbq\load.py:130 in load_parquet
client.load_table_from_dataframe(
File ~\Anaconda3\envs\darts_model\lib\site-packages\google\cloud\bigquery\client.py:2628 in load_table_from_dataframe
_pandas_helpers.dataframe_to_parquet(
File ~\Anaconda3\envs\darts_model\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py:672 in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
File ~\Anaconda3\envs\darts_model\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py:617 in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
File ~\Anaconda3\envs\darts_model\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py:342 in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File pyarrow\array.pxi:1033 in pyarrow.lib.Array.from_pandas
File pyarrow\array.pxi:312 in pyarrow.lib.array
File pyarrow\array.pxi:83 in pyarrow.lib._ndarray_to_array
File pyarrow\error.pxi:123 in pyarrow.lib.check_status
ArrowTypeError: Expected bytes, got a 'datetime.date' object
Relevant package versions below:
python==3.9.12
pandas==1.4.2
pandas-gbq==0.17.6
arrow==1.2.2
google-cloud-bigquery==3.2.0
google-cloud-bigquery-storage==2.13.2
Can't seem to find any solutions online so any help would be greatly appreciated! Thanks.
Upvotes: 8
Views: 15223
Reputation: 534
I have seen a similar issue that I can't immediately replicate. However I can describe it. I have Pandas Dataframes with about 500,000 64 bit integers that I have converted to strings. I write them to parquet files using pyarrow. This has always worked fine.
However, recently a new record was created with an ID that began with the four digits 1001. Subsequently, what had always worked broke that day. Either a coincidence or PyArrow saw that 1001 and convinced itself it was receiving bytes, and threw the ArrowTypeError ("Expected bytes, got a 'int' object")
Upvotes: 0
Reputation: 532
One alternative solution to the to_gbq() method is to use google cloud's bigquery package.
While the schema of the bigquery table and the local df are the same, appending to the BigQuery table can be accomplished with the following code:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
# define project, dataset, and table_name variables
project, dataset, table_name = "project", "dataset", "table_name"
table_id = f"{project}.{dataset}.{table_name}"
job_config = bigquery.job.LoadJobConfig()
# set write_disposition parameter as WRITE_APPEND for appending to table
job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
f"Loaded {table.num_rows} rows and {len(table.schema)} columns to {table_id}"
)
The end result yields the same result - appending to a BigQuery table.
Upvotes: 4
Reputation: 1115
One of your columns probably has a type of 'object', but the underlying type is actually a datetime. You can convert it as follows to make to_gbq() work. This relies on the destination type being a 'TIMESTAMP' though.
object_cols = df.select_dtypes(include=['object']).columns
for column in object_cols:
dtype = str(type(df[column].values[0]))
if dtype == "<class 'datetime.date'>":
df[column] = pd.to_datetime(df[column] , infer_datetime_format=True)
df.to_gbq(destination_table=PROCESSED_DATA_TABLE_NAME, if_exists='replace')
Note, this will replace the table. If you want to append then you need to make sure the types match by using the table_schema parameter in the to_gbq() function.
Upvotes: 4