ASturt
ASturt

Reputation: 363

Snowflake write_pandas is not inserting dates correctly

I have a pandas dataframe named "df" that I'm creating with the result of a SQL query against a Netezza database. I'm working in Jupyter notebook. The dataframe has two rows and two of the columns (CREATEDDATE and STAGEDATE) contain datetime values. When I run print(df), the result looks like this:

                   ID ISDELETED            PARENTID         CREATEDBYID  \
0  017o000003tQRftAAG     false  a0no000000Hrv1IAAR  005o0000001w8wgAAA   
1  017o000003jl52cAAA     false  a0no000000GszDUAAZ  005o0000001w2pTAAQ   

          CREATEDDATE    FIELD OLDVALUE NEWVALUE  STAGEDATE  
0 2015-07-30 14:51:41  created     None     None 2016-06-06  
1 2015-07-16 14:48:37  created     None     None 2016-06-06

If I run print(df.dtypes), the result is this:

ID                     object
ISDELETED              object
PARENTID               object
CREATEDBYID            object
CREATEDDATE    datetime64[ns]
FIELD                  object
OLDVALUE               object
NEWVALUE               object
STAGEDATE      datetime64[ns]
dtype: object

So, as far as I can tell, my datetime columns are correctly formatted to write to Snowflake using write_pandas(). However, after I do that, the dates are wildly different in Snowflake: enter image description here

The STAGEDATE value of 2016-06-06, for example, is now 48399-06-06. Does anyone know how to fix this? I'm doing my pull from Netezza using pyodbc and using df = cs.execute() to populate the dataframe. I am importing and using snowflake.connector for my connection to Snowflake and am running the following to get write_pandas:

from snowflake.connector.pandas_tools import write_pandas

Upvotes: 15

Views: 17327

Answers (6)

Salomon Davia
Salomon Davia

Reputation: 61

The solution presented by Lorenzo sufficed for local development but failed when containerizing the script. I assumed a missing Python library, but found a better solution offered by Snowflake.

The problem was fixed by updating the version of snowflake-connector-python==3.4.0 and using the use_logical_type=True parameter in write_pandas to insert the datetime pandas columns with no time zone (NTZ). This is part of the underlying COPY INTO file format parameter for Parquet used for the snowflake connector.

success, nchunks, nrows, ouput = write_pandas(conn, df, 
                                              table_name=df.name,
                                              auto_create_table=True,
                                              overwrite=True,
                                              chunk_size=150000,
                                              compression='gzip',
                                              on_error="ABORT_STATEMENT",
                                              parallel=4,                                       
                                              quote_identifiers=True, 
                                              use_logical_type=True)

Upvotes: 1

gkaur
gkaur

Reputation: 71

The solution fix_date_cols(df, tz = 'UTC') regarding fixing datetime columns by localizing to UTC did not work for me.

However, write_pandas() now issues a userwarning to include use_logical_type=True parameter if your dataframe has datetime columns. And that worked!

write_pandas(conn=conn, df=df_clean, table_name='VENTITY_STG', use_logical_type=True)

Upvotes: 7

N1234
N1234

Reputation: 588

I'm hoping this will help others that are in a similar position to mine.

My data frame column had a data type of date and I kept getting this error: Python Date Upload Error - Data Type Conversion

Eventually I found the solution was to change the date to a ISO 8601 datetime format.

dataDF["DATE_COL"] = dataDF["DATE_COL"].astype('datetime64')
dataDF["DATE_COL"] = dataDF["DATE_COL"].dt.strftime('%Y-%m-%d') #Note full format can be strftime('%Y-%m-%dT%H:%M:%SZ')

Hope this helps others as I spent AGES trying to figure this out!

Upvotes: 9

Greg Jessup
Greg Jessup

Reputation: 1

The solution from @Lorenzo Vitali works well. And I've added it to my snowflake helper class. You'll want to remember to return df. While this a “partial copy” it fixes the above issue.

def fix_date_cols(df, tz='UTC'):
     cols = df.select_dtypes(include=['datetime64[ns]']).columns
     for col in cols:
         df[col] = df[col].dt.tz_localize(tz)
     return df

Upvotes: 0

Lorenzo Vitali
Lorenzo Vitali

Reputation: 340

I found a solution with the code that pday wrote here. This function automatically adds to the date type cols a timezone (default one in the function is UTC).

def fix_date_cols(df, tz='UTC'):
    cols = df.select_dtypes(include=['datetime64[ns]']).columns
    for col in cols:
        df[col] = df[col].dt.tz_localize(tz)

So I would suggest using this function before pass the dataframe to write_pandas.

From what I get the issue is that the datetime object is misinterpreted, because it is not enough well defined. Adding the info of a timezone will force the interpretation of the elements as datetime.

But I really recommend you to read this amazing conversation where it is well explained, it really helped me.

Upvotes: 11

Simeon Pilgrim
Simeon Pilgrim

Reputation: 26043

SELECT '2016-06-06'::timestamp as a
    ,to_timestamp_ntz(date_part('epoch_second', a),0)::date as a_s
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),0)::date as a_ms
    ,to_timestamp_ntz(date_part('epoch_millisecond', a),3)::date as b_ms;

gives

A                        A_S          A_MS         B_MS
2016-06-06 00:00:00.000  2016-06-06   48399-06-06  2016-06-06

which is to say, your date, turned into epoch millisconds and parse as seconds gives your date.

so you ether throw away your milliseconds, or alter how your dates are getting parsed.

Upvotes: 0

Related Questions