VZH
VZH

Reputation: 1

Pandas read_excel function is not able to import date and time fields properly

I'm reading the excel file that has two separate fields for date and time. This is how they look like when I open the file in excel:

Local Date  Local Time
31/08/2021  14:09:53
31/08/2021  14:11:52
31/08/2021  14:11:02
31/08/2021  14:09:25

Now, I'm using the following code:

df_tmp = pd.read_excel('1-5.xlsb', parse_dates=[['Local Date','Local Time']])

And this is what I'm getting:

Local Date_Local Time
31/08/2021 0.5901967592592593
31/08/2021 0.5915740740740741

Moreover, when I explore the content

df_tmp['Local Date_Local Time'].str.len().unique()

I can see the whole bunch of cases:

array([29, 28, 20, 18, 19, 16, 15, 17, 14, 24, 23, 11, 10, 13, 12, 27, 26,
       25, 33, 32, 31, 30,  9], dtype=int64)

So that the shortest cases are represented by a really weird dates:

df_tmp[df_tmp['Local Date_Local Time'].str.len()==13].head(2)

Local Date_Local Time   
44205 0.90375   
44205 0.92625

In excel, same values look like this:

1/9/2021    21:41:24
1/9/2021    21:21:35

Longer ones (len>13) look like normal dates, just time is fractional.

I know how to deal with each case, so that 44205 should be converted to date with base 1899-12-31; 0.90375 is also fraction of the single day, but my question is why is in excel everything is normal without any efforts and here I have to invent tricky converters? Maybe there is a simpler approach and I'm missing something?


Update (10/01/2021): OK, in order to simplify and localise the problem I did the following:

df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':str})
df_tmp['Local Date'].unique()
array(['31/08/2021', '44205', '30/08/2021', '28/08/2021', '29/08/2021',
       '44236', '44264', '44295', '44325'], dtype=object)

If I save the file in excel as 'xlsx', then the result will be different:

array(['31/08/2021', '2021-01-09 00:00:00', '30/08/2021', '28/08/2021',
       '29/08/2021', '2021-02-09 00:00:00', '2021-03-09 00:00:00',
       '2021-04-09 00:00:00', '2021-05-09 00:00:00'], dtype=object)

Note, that dates that were converted from int, are having mixed day and month - there supposed to be an interval of dates from Aug, 28 to Sep, 5 - so it seems it is a problem of the original data in excel. Anyway, to deal with this situation I have done the following:

def date_parser(x):
    if type(x) == int:
        y = pd.to_datetime(pd.to_datetime(x, unit='D', origin='1899-12-30').strftime(format='%Y-%d-%m'))
    else:
        y = pd.to_datetime(x, dayfirst=True)
    return y
df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':date_parser})

Then, adding time is not a big deal. Thanks for your help!

Upvotes: 0

Views: 1272

Answers (1)

Jonathan Leon
Jonathan Leon

Reputation: 5648

Your dates are day first, which is why some of your dates look ok, while others don't. For the date you can do this:

date_parser = lambda x: datetime.datetime.strptime(x, "%d/%m/%Y")
pd.read_excel('test.xlsb', parse_dates=['Local Date'], date_parser=date_parser)

Note the date format changes once read into pandas

  Local Date  Local Time
0 2021-08-31    0.590197
1 2021-08-31    0.591574
2 2021-08-31    0.590995
3 2021-08-31    0.589873

Another option if you want to create one datetime field you can do this. Note, there's a deprecation warning but I didn't have time to correct it...

date_parser = lambda x, y: datetime.datetime.strptime(x, "%d/%m/%Y") + pd.to_timedelta(y, unit='D')
pd.read_excel('test.xlsb', parse_dates=[['Local Date','Local Time']], date_parser=date_parser)

       Local Date_Local Time
0 2021-08-31 14:09:52.999977
1 2021-08-31 14:11:51.999993
2 2021-08-31 14:11:01.999968
3 2021-08-31 14:09:24.999984

Upvotes: 1

Related Questions