Pandas read_excel function is not able to import date and time fields properly

Question

I'm reading the excel file that has two separate fields for date and time. This is how they look like when I open the file in excel:

Local Date  Local Time
31/08/2021  14:09:53
31/08/2021  14:11:52
31/08/2021  14:11:02
31/08/2021  14:09:25

Now, I'm using the following code:

df_tmp = pd.read_excel('1-5.xlsb', parse_dates=[['Local Date','Local Time']])

And this is what I'm getting:

Local Date_Local Time
31/08/2021 0.5901967592592593
31/08/2021 0.5915740740740741

Moreover, when I explore the content

df_tmp['Local Date_Local Time'].str.len().unique()

I can see the whole bunch of cases:

array([29, 28, 20, 18, 19, 16, 15, 17, 14, 24, 23, 11, 10, 13, 12, 27, 26,
       25, 33, 32, 31, 30,  9], dtype=int64)

So that the shortest cases are represented by a really weird dates:

df_tmp[df_tmp['Local Date_Local Time'].str.len()==13].head(2)

Local Date_Local Time   
44205 0.90375   
44205 0.92625

In excel, same values look like this:

1/9/2021    21:41:24
1/9/2021    21:21:35

Longer ones (len>13) look like normal dates, just time is fractional.

I know how to deal with each case, so that 44205 should be converted to date with base 1899-12-31; 0.90375 is also fraction of the single day, but my question is why is in excel everything is normal without any efforts and here I have to invent tricky converters? Maybe there is a simpler approach and I'm missing something?

Update (10/01/2021): OK, in order to simplify and localise the problem I did the following:

df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':str})
df_tmp['Local Date'].unique()
array(['31/08/2021', '44205', '30/08/2021', '28/08/2021', '29/08/2021',
       '44236', '44264', '44295', '44325'], dtype=object)

If I save the file in excel as 'xlsx', then the result will be different:

array(['31/08/2021', '2021-01-09 00:00:00', '30/08/2021', '28/08/2021',
       '29/08/2021', '2021-02-09 00:00:00', '2021-03-09 00:00:00',
       '2021-04-09 00:00:00', '2021-05-09 00:00:00'], dtype=object)

Note, that dates that were converted from int, are having mixed day and month - there supposed to be an interval of dates from Aug, 28 to Sep, 5 - so it seems it is a problem of the original data in excel. Anyway, to deal with this situation I have done the following:

def date_parser(x):
    if type(x) == int:
        y = pd.to_datetime(pd.to_datetime(x, unit='D', origin='1899-12-30').strftime(format='%Y-%d-%m'))
    else:
        y = pd.to_datetime(x, dayfirst=True)
    return y
df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':date_parser})

Then, adding time is not a big deal. Thanks for your help!

Pandas read_excel function is not able to import date and time fields properly

Answers (1)

Related Questions