Pruthvi Amin
Pruthvi Amin

Reputation: 85

Pandas datetime values messed up after saving df to excel and then reading back into a df

jan_21=[datetime(2021,1,1) + timedelta(hours=i) for i in range(5)]


jan_21
    
[datetime.datetime(2021, 1, 1, 0, 0),
 datetime.datetime(2021, 1, 1, 1, 0),
 datetime.datetime(2021, 1, 1, 2, 0),
 datetime.datetime(2021, 1, 1, 3, 0),
 datetime.datetime(2021, 1, 1, 4, 0)]

prices = np.random.randint(1,100,size=(5,))

prices

[46 23 13 26 52]

df = pd.DataFrame({'datetime':jan_21, 'price':prices})

df

             datetime  price
0 2021-01-01 00:00:00     83
1 2021-01-01 01:00:00     60
2 2021-01-01 02:00:00     29
3 2021-01-01 03:00:00     97
4 2021-01-01 04:00:00     67

All good so far, this is how I expected the dataframe and datetime values to be displayed. The problem comes when I save the dataframe to an excel file and then read it back into a dataframe, the datetime values get messed up.

df.to_excel('price_data.xlsx', index=False)

new_df = pd.read_excel('price_data.xlsx')

new_df

                      datetime  price
0   2021-01-01 00:00:00.000000  83
1   2021-01-01 00:59:59.999999  60
2   2021-01-01 02:00:00.000001  29
3   2021-01-01 03:00:00.000000  97
4   2021-01-01 03:59:59.999999  67

I'd like df == new_df to evaluate to True

Upvotes: 2

Views: 1087

Answers (2)

Georgy Kopshteyn
Georgy Kopshteyn

Reputation: 763

Against the backdrop of the likely cause of the issue (see sophros' answer), what you could do to - superficiously - circumvent the problem is converting the cells of df["datetime"] to strings before producing the excel file and then converting the strings to datetime again, after new_df has been created:

df["datetime"] = df["datetime"].dt.strftime("%m/%d/%Y, %H:%M:%S")
df.to_excel('price_data.xlsx', index=False)

new_df = pd.read_excel('price_data.xlsx')
new_df["datetime"] = pd.to_datetime(new_df["datetime"], format="%m/%d/%Y, %H:%M:%S")

Upvotes: 2

sophros
sophros

Reputation: 16660

The reason for the difference in time part of 00:59:59.999999 and 02:00:00.000001 and 03:59:59.999999 is most likely related to a slightly different binary representation of date/time types in Excel and Python or pandas.

The time is frequently stored as a float but the difference is when is the 0-th time (e.g. year 1 AC or 1970 - as in Linux; good explanation here). Therefore, the conversion may loose some least significant parts of the date/time and there is not much you can do about it but round it up or using approximate comparisons as with any float.

Upvotes: 0

Related Questions