Simon Breton
Simon Breton

Reputation: 2876

"pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data" when sending data to BigQuery without schema

I'm working on a script where I'm sending a dataframe to BigQuery:

load_job = bq_client.load_table_from_dataframe(
    df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE])
)

# Wait for the load job to complete
return load_job.result() 

This is working fine but only if a schema has already been defined in BigQuery or if I'm defining the schema of my job in my script. If no schema has been defined I have the following error:

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1661, in load_table_from_dataframe dataframe.to_parquet(tmppath, compression=parquet_compression) File "/env/local/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet **kwargs File "/env/local/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write **kwargs File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 1270, in write_table writer.write_table(table, row_group_size=row_group_size) File "/env/local/lib/python3.7/site-packages/pyarrow/parquet.py", line 426, in write_table self.writer.write_table(table, row_group_size=row_group_size) File "pyarrow/_parquet.pyx", line 1311, in pyarrow._parquet.ParquetWriter.write_table File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1578661876547574000 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 151, in main df = df(param1, param2) File "/user_code/main.py", line 141, in get_df df, '.'.join([PROJECT, DATASET, PROGRAMS_TABLE]) File "/env/local/lib/python3.7/site-packages/google/cloud/bigquery/client.py", line 1677, in load_table_from_dataframe os.remove(tmppath) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp_ps5xji9_job_634ff274.parquet'

Why is pyarrow generating this error? How can I solve it other than pre-defining schema?

Upvotes: 23

Views: 39803

Answers (7)

russhoppa
russhoppa

Reputation: 403

If you don't care about the float values of the seconds you can just remove them and pyarrow won't have a problem converting the data. It just has a hard time retaining a lot of specificity with the seconds values; e.g.,, 2023-01-10 21:02:46.153846272 --> 2023-01-10 21:02:46

example:

df.date_column = df.date_column.dt.floor('s')

Upvotes: 1

Bushmaster
Bushmaster

Reputation: 4608

I was getting the same error. When i inspect the dataframe i saw a value like this: 2021-09-30 23:59:59.999999998

Probably there is a mismatch in your date fields with the bigquery default. Use this:

df['date_column'] =df['date_column'].astype('datetime64[s]')

Upvotes: 17

Contango
Contango

Reputation: 80372

If using write_dataset, example code to use file_options to eliminate this error:

import pyarrow.dataset as ds
parquet_format = ds.ParquetFileFormat()
file_options = parquet_format.make_write_options(coerce_timestamps='us', allow_truncated_timestamps=True)

ds.write_dataset(..., file_options=file_options)

Added as anyone who queries the PyArrow error in the title will end up here.

Upvotes: 3

BHP
BHP

Reputation: 996

The solution for me was to add the following kwargs to to_parquet:

parquet_args = {
    'coerce_timestamps': 'us',
    'allow_truncated_timestamps': True,
}

You have to set both of them. If you set just allow_truncated_timestamps, it will still raise the error if coerce_timestamps is None. I think the idea is that you only want to suppress the error if you're explicitly asking for coercion. Anyway, the docs are clear about it, but this behavior wasn't obvious to me.

Upvotes: 8

Tim Swena
Tim Swena

Reputation: 14786

In my testing of https://github.com/googleapis/python-bigquery-pandas/pull/413, this issue is fixed by upgrading to pandas 1.1.0+.

Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime

Upvotes: 0

Matt Simmons
Matt Simmons

Reputation: 151

I think these errors arise because the pyarrow.parquet module used by the BigQuery library does convert Python's built in datetime or time types into something that BigQuery recognises by default, but the BigQuery library does have its own method for converting pandas types.

I was able to get it to upload timestamps by changing all instances of datetime.datetime or time.time to pandas.Timestamp. For example:

my_df['timestamp'] = datetime.utcnow()

would need to be changed to

my_df['timestamp'] = pd.Timestamp.now()

Upvotes: 1

Wes McKinney
Wes McKinney

Reputation: 105611

The default behavior when converting to Arrow or Parquet from pandas is to not allow silent data loss. There are options to set when doing the conversion to allow permit unsafe casts causing loss of timestamp precision or other forms of data loss. The BigQuery Python API would need to set these options, so it may be a bug in the BigQuery library. I suggest reporting on their issue tracker https://github.com/googleapis/google-cloud-python

Upvotes: 7

Related Questions