Saucy Goat
Saucy Goat

Reputation: 1675

Python Pandas: Supporting 25 hours in datetime index

I want to use a date/time as an index for a dataframe in Pandas.

However, daylight saving time is not properly addressed in the database, so the date/time values for the day in which daylight saving time ends have 25 hours and are represented as such:

2019102700
2019102701
...
2019102724

I am using the following code to convert those values to a DateTime object that I use as an index to a Pandas dataframe:

df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")

However, that gives an error:

ValueError: unconverted data remains: 4

Presumably because the to_datetime function is not expecting the hour to be 24. Similarly, the day in which daylight saving time starts only has 23 hours.

One solution I thought of was storing the dates as strings, but that seems neither elegant nor efficient. Is there any way to solve the issue of handling daylight saving time when using to_datetime?

Upvotes: 1

Views: 1045

Answers (3)

Amir P
Amir P

Reputation: 151

Pick the first and the last index, convert them to tz_aware datetime, then you can generate a date_range that handles 25-hour days. And assign the date_range to your df index:

start = pd.to_datetime(df.index[0]).tz_localize("Europe/Berlin")
end = pd.to_datetime(df.index[-1]).tz_localize("Europe/Berlin")
index_ = pd.date_range(start, end, freq="15min")
df = df.set_index(index_)

Upvotes: 1

FObersteiner
FObersteiner

Reputation: 25564

If you know the timezone, here's a way to calculate UTC timestamps. Parse only the date part, localize to the actual time zone the data "belongs" to, and convert that to UTC. Now you can parse the hour part and add it as a time delta - e.g.

import pandas as pd 

df = pd.DataFrame({'date_time_str': ['2019102722','2019102723','2019102724',
                                     '2019102800','2019102801','2019102802']})

df['date_time'] = (pd.to_datetime(df['date_time_str'].str[:-2], format='%Y%m%d')
                   .dt.tz_localize('Europe/Berlin')
                   .dt.tz_convert('UTC'))

df['date_time'] += df['date_time_str'].str[-2:].astype('timedelta64[h]')

# df['date_time']
# 0   2019-10-27 20:00:00+00:00
# 1   2019-10-27 21:00:00+00:00
# 2   2019-10-27 22:00:00+00:00
# 3   2019-10-27 23:00:00+00:00
# 4   2019-10-28 00:00:00+00:00
# 5   2019-10-28 01:00:00+00:00
# Name: date_time, dtype: datetime64[ns, UTC]

Upvotes: 1

hoomant
hoomant

Reputation: 455

I'm not sure if it is the most elegant or efficient solution, but I would:

df.loc[df.date_time.str[-2:]=='25', 'date_time'] = (pd.to_numeric(df.date_time[df.date_time.str[-2:]=='25'])+100-24).apply(str)
df.index = pd.to_datetime(df["date_time"], format="%Y%m%d%H")

Upvotes: 1

Related Questions