Reputation: 363
I have a pandas dataframe named "df" that I'm creating with the result of a SQL query against a Netezza database. I'm working in Jupyter notebook. The dataframe has two rows and two of the columns (CREATEDDATE and STAGEDATE) contain datetime values. When I run print(df), the result looks like this:
ID ISDELETED PARENTID CREATEDBYID \
0 017o000003tQRftAAG false a0no000000Hrv1IAAR 005o0000001w8wgAAA
1 017o000003jl52cAAA false a0no000000GszDUAAZ 005o0000001w2pTAAQ
CREATEDDATE FIELD OLDVALUE NEWVALUE STAGEDATE
0 2015-07-30 14:51:41 created None None 2016-06-06
1 2015-07-16 14:48:37 created None None 2016-06-06
If I run print(df.dtypes), the result is this:
ID object
ISDELETED object
PARENTID object
CREATEDBYID object
CREATEDDATE datetime64[ns]
FIELD object
OLDVALUE object
NEWVALUE object
STAGEDATE datetime64[ns]
dtype: object
So, as far as I can tell, my datetime columns are correctly formatted to write to Snowflake using write_pandas(). However, after I do that, the dates are wildly different in Snowflake:
The STAGEDATE value of 2016-06-06, for example, is now 48399-06-06. Does anyone know how to fix this? I'm doing my pull from Netezza using pyodbc and using df = cs.execute() to populate the dataframe. I am importing and using snowflake.connector for my connection to Snowflake and am running the following to get write_pandas:
from snowflake.connector.pandas_tools import write_pandas
Upvotes: 15
Views: 17327
Reputation: 61
The solution presented by Lorenzo sufficed for local development but failed when containerizing the script. I assumed a missing Python library, but found a better solution offered by Snowflake.
The problem was fixed by updating the version of snowflake-connector-python==3.4.0 and using the use_logical_type=True
parameter in write_pandas
to insert the datetime pandas columns with no time zone (NTZ). This is part of the underlying COPY INTO file format parameter for Parquet used for the snowflake connector.
success, nchunks, nrows, ouput = write_pandas(conn, df,
table_name=df.name,
auto_create_table=True,
overwrite=True,
chunk_size=150000,
compression='gzip',
on_error="ABORT_STATEMENT",
parallel=4,
quote_identifiers=True,
use_logical_type=True)
Upvotes: 1
Reputation: 71
The solution fix_date_cols(df, tz = 'UTC')
regarding fixing datetime columns by localizing to UTC did not work for me.
However, write_pandas() now issues a userwarning to include use_logical_type=True parameter if your dataframe has datetime columns. And that worked!
write_pandas(conn=conn, df=df_clean, table_name='VENTITY_STG', use_logical_type=True)
Upvotes: 7
Reputation: 588
I'm hoping this will help others that are in a similar position to mine.
My data frame column had a data type of date and I kept getting this error:
Eventually I found the solution was to change the date to a ISO 8601 datetime format.
dataDF["DATE_COL"] = dataDF["DATE_COL"].astype('datetime64')
dataDF["DATE_COL"] = dataDF["DATE_COL"].dt.strftime('%Y-%m-%d') #Note full format can be strftime('%Y-%m-%dT%H:%M:%SZ')
Hope this helps others as I spent AGES trying to figure this out!
Upvotes: 9
Reputation: 1
The solution from @Lorenzo Vitali works well. And I've added it to my snowflake helper class. You'll want to remember to return df. While this a “partial copy” it fixes the above issue.
def fix_date_cols(df, tz='UTC'):
cols = df.select_dtypes(include=['datetime64[ns]']).columns
for col in cols:
df[col] = df[col].dt.tz_localize(tz)
return df
Upvotes: 0
Reputation: 340
I found a solution with the code that pday wrote here. This function automatically adds to the date type cols a timezone (default one in the function is UTC).
def fix_date_cols(df, tz='UTC'):
cols = df.select_dtypes(include=['datetime64[ns]']).columns
for col in cols:
df[col] = df[col].dt.tz_localize(tz)
So I would suggest using this function before pass the dataframe to write_pandas.
From what I get the issue is that the datetime object is misinterpreted, because it is not enough well defined. Adding the info of a timezone will force the interpretation of the elements as datetime.
But I really recommend you to read this amazing conversation where it is well explained, it really helped me.
Upvotes: 11
Reputation: 26043
SELECT '2016-06-06'::timestamp as a
,to_timestamp_ntz(date_part('epoch_second', a),0)::date as a_s
,to_timestamp_ntz(date_part('epoch_millisecond', a),0)::date as a_ms
,to_timestamp_ntz(date_part('epoch_millisecond', a),3)::date as b_ms;
gives
A A_S A_MS B_MS
2016-06-06 00:00:00.000 2016-06-06 48399-06-06 2016-06-06
which is to say, your date, turned into epoch millisconds and parse as seconds gives your date.
so you ether throw away your milliseconds, or alter how your dates are getting parsed.
Upvotes: 0