Reputation: 1
I'm reading the excel file that has two separate fields for date and time. This is how they look like when I open the file in excel:
Local Date Local Time
31/08/2021 14:09:53
31/08/2021 14:11:52
31/08/2021 14:11:02
31/08/2021 14:09:25
Now, I'm using the following code:
df_tmp = pd.read_excel('1-5.xlsb', parse_dates=[['Local Date','Local Time']])
And this is what I'm getting:
Local Date_Local Time
31/08/2021 0.5901967592592593
31/08/2021 0.5915740740740741
Moreover, when I explore the content
df_tmp['Local Date_Local Time'].str.len().unique()
I can see the whole bunch of cases:
array([29, 28, 20, 18, 19, 16, 15, 17, 14, 24, 23, 11, 10, 13, 12, 27, 26,
25, 33, 32, 31, 30, 9], dtype=int64)
So that the shortest cases are represented by a really weird dates:
df_tmp[df_tmp['Local Date_Local Time'].str.len()==13].head(2)
Local Date_Local Time
44205 0.90375
44205 0.92625
In excel, same values look like this:
1/9/2021 21:41:24
1/9/2021 21:21:35
Longer ones (len>13) look like normal dates, just time is fractional.
I know how to deal with each case, so that 44205 should be converted to date with base 1899-12-31; 0.90375 is also fraction of the single day, but my question is why is in excel everything is normal without any efforts and here I have to invent tricky converters? Maybe there is a simpler approach and I'm missing something?
Update (10/01/2021): OK, in order to simplify and localise the problem I did the following:
df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':str})
df_tmp['Local Date'].unique()
array(['31/08/2021', '44205', '30/08/2021', '28/08/2021', '29/08/2021',
'44236', '44264', '44295', '44325'], dtype=object)
If I save the file in excel as 'xlsx', then the result will be different:
array(['31/08/2021', '2021-01-09 00:00:00', '30/08/2021', '28/08/2021',
'29/08/2021', '2021-02-09 00:00:00', '2021-03-09 00:00:00',
'2021-04-09 00:00:00', '2021-05-09 00:00:00'], dtype=object)
Note, that dates that were converted from int, are having mixed day and month - there supposed to be an interval of dates from Aug, 28 to Sep, 5 - so it seems it is a problem of the original data in excel. Anyway, to deal with this situation I have done the following:
def date_parser(x):
if type(x) == int:
y = pd.to_datetime(pd.to_datetime(x, unit='D', origin='1899-12-30').strftime(format='%Y-%d-%m'))
else:
y = pd.to_datetime(x, dayfirst=True)
return y
df_tmp = pd.read_excel('1-5.xlsb', converters={'Local Date':date_parser})
Then, adding time is not a big deal. Thanks for your help!
Upvotes: 0
Views: 1272
Reputation: 5648
Your dates are day first, which is why some of your dates look ok, while others don't. For the date you can do this:
date_parser = lambda x: datetime.datetime.strptime(x, "%d/%m/%Y")
pd.read_excel('test.xlsb', parse_dates=['Local Date'], date_parser=date_parser)
Note the date format changes once read into pandas
Local Date Local Time
0 2021-08-31 0.590197
1 2021-08-31 0.591574
2 2021-08-31 0.590995
3 2021-08-31 0.589873
Another option if you want to create one datetime field you can do this. Note, there's a deprecation warning but I didn't have time to correct it...
date_parser = lambda x, y: datetime.datetime.strptime(x, "%d/%m/%Y") + pd.to_timedelta(y, unit='D')
pd.read_excel('test.xlsb', parse_dates=[['Local Date','Local Time']], date_parser=date_parser)
Local Date_Local Time
0 2021-08-31 14:09:52.999977
1 2021-08-31 14:11:51.999993
2 2021-08-31 14:11:01.999968
3 2021-08-31 14:09:24.999984
Upvotes: 1