Reputation: 3961
I'm trying to find the nearest datetime index of my table. I'm using this post as a starting point, and am using this MWE:
import os
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta
df = pd.DataFrame()
df['datetime'] = pd.date_range(start='2019-01-01', end='2021-01-01', freq='H')
df = df.set_index('datetime')
df['year'] = pd.DatetimeIndex(df.index).year
df['mnth'] = pd.DatetimeIndex(df.index).month
df['day'] = pd.DatetimeIndex(df.index).day
df['dow'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['hour'] = pd.DatetimeIndex(df.index).hour
years = df.year.unique()
idxlist = []
for y in years:
idx1 = df.loc[((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2))]
#idx1 = df.iloc[df.get_loc(((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2)), method='nearest')]
idxlist.append(idx1)
Edit based on Michael Delgado comments:
I have several years' worth of daily data, including for the correct days (first Sunday of April in every year).
Even though this works with my MWE, my actual dataset contains missing data and there may not be data for exactly 2am. Data is spaced roughly 20-35min intervals, so the closest value should be less than 15min away from the 2AM target.
I want to find the nearest datetime to 2am in the first Sunday in April. This is for every year in the DataFrame, but I'm not sure how to do this.
Upvotes: 0
Views: 429
Reputation: 15432
Based on your comments, it seems that you can rely on always having data within an hour of your desired time (1st Sunday of April) in each year. In this case, you can take a simpler approach.
Using an example dataset with variation in the times:
In [4]: df = pd.DataFrame(
...: ...: {'val': np.arange(24*366*10)},
...: ...: index=(
...: ...: pd.date_range('2010-01-01', periods=24*366*10, freq='H')
...: ...: + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
...: ...: ),
...: ...: )
In [5]: df
Out[5]:
val
2010-01-01 00:14:00 0
2010-01-01 01:20:00 1
2010-01-01 01:46:00 2
2010-01-01 03:20:00 3
2010-01-01 03:51:00 4
... ...
2020-01-08 18:48:00 87835
2020-01-08 19:46:00 87836
2020-01-08 21:07:00 87837
2020-01-08 22:06:00 87838
2020-01-08 23:11:00 87839
[87840 rows x 1 columns]
We can filter based on times rounded to the nearest 2 hours:
within_an_hour = df[
(df.index.month==4)
& (df.index.day<=7)
& (df.index.day_of_week == 6)
& (df.index.round('2H').hour == 2)
]
We can then select the closest indices by taking the minimum absolute difference to the 2-hour rounded value for each year:
In [15]: closest_indices = (
...: within_an_hour
...: .groupby(within_an_hour.index.year)
...: .apply(
...: lambda x: x.index.values[np.argmin(abs(x.index - x.index.round('2H')))]
...: )
...: )
In [16]: closest_indices
Out[16]:
2010 2010-04-04 02:17:00
2011 2011-04-03 02:22:00
2012 2012-04-01 01:49:00
2013 2013-04-07 01:39:00
2014 2014-04-06 02:01:00
2015 2015-04-05 01:58:00
2016 2016-04-03 02:12:00
2017 2017-04-02 01:54:00
2018 2018-04-01 02:22:00
2019 2019-04-07 02:13:00
dtype: datetime64[ns]
Upvotes: 1
Reputation: 15432
This is a bit of a challenge, just because "the first sunday of April in any year" takes a few steps to compute. You could approach this in a few ways, but I'll approach this by first computing the nearest target date in the year of the target date, as well as the following year (because April is always in the first part of the year, so the date will never be closer to the April in the preceding year), then finding the minimum absolute difference to either of the targets.
First step, I'll expand on your MWE with some random offsets (+/- 30 mins) and a longer time series. I also added a value column so the df displays as a frame:
In [26]: df = pd.DataFrame(
...: {'val': np.arange(24*366*10)},
...: index=(
...: pd.date_range('2010-01-01', periods=24*366*10, freq='H')
...: + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
...: ),
...: )
In [27]: df
Out[27]:
val
2010-01-01 00:29:00 0
2010-01-01 01:09:00 1
2010-01-01 01:43:00 2
2010-01-01 03:14:00 3
2010-01-01 03:54:00 4
... ...
2020-01-08 18:31:00 87835
2020-01-08 20:21:00 87836
2020-01-08 20:54:00 87837
2020-01-08 21:47:00 87838
2020-01-08 23:11:00 87839
Next, I find the date of the first Sunday in April (at 2 AM) for the year of each row:
In [28]: apr1 = pd.to_datetime({'year': df.index.year, 'month': 4, 'day': 1, 'hour': 2})
In [29]: apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')
In [30]: apr_first_sun
Out[30]:
0 2010-04-04 02:00:00
1 2010-04-04 02:00:00
2 2010-04-04 02:00:00
3 2010-04-04 02:00:00
4 2010-04-04 02:00:00
...
87835 2020-04-05 02:00:00
87836 2020-04-05 02:00:00
87837 2020-04-05 02:00:00
87838 2020-04-05 02:00:00
87839 2020-04-05 02:00:00
Length: 87840, dtype: datetime64[ns]
In [31]: apr1 = pd.to_datetime({'year': df.index.year + 1, 'month': 4, 'day': 1, 'hour': 2})
In [32]: next_apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')
Next, find the closer absolute difference:
In [36]: nearer_abs_diff = np.minimum(abs(df.index - apr_first_sun.values), abs(df.index - next_apr_first_sun.values))
In [37]: nearer_abs_diff
Out[37]:
TimedeltaIndex(['93 days 01:31:00', '93 days 00:51:00', '93 days 00:17:00',
'92 days 22:46:00', '92 days 22:06:00', '92 days 20:54:00',
'92 days 20:23:00', '92 days 19:25:00', '92 days 18:12:00',
'92 days 16:48:00',
...
'87 days 12:19:00', '87 days 11:12:00', '87 days 09:36:00',
'87 days 08:31:00', '87 days 07:36:00', '87 days 07:29:00',
'87 days 05:39:00', '87 days 05:06:00', '87 days 04:13:00',
'87 days 02:49:00'],
dtype='timedelta64[ns]', length=87840, freq=None
Finally, find the positional index of the minimum absolute difference and use that to index into the dataframe:
In [38]: idx = np.argmin(nearer_abs_diff)
In [39]: df.iloc[idx]
Out[39]:
val 37346
Name: 2014-04-06 02:14:00, dtype: int64
Upvotes: 1