Pandas find nearest datetime index with conditional arguments

Question

I'm trying to find the nearest datetime index of my table. I'm using this post as a starting point, and am using this MWE:

import os
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta

df = pd.DataFrame() 
df['datetime'] = pd.date_range(start='2019-01-01', end='2021-01-01', freq='H')
df = df.set_index('datetime')

df['year'] = pd.DatetimeIndex(df.index).year
df['mnth'] = pd.DatetimeIndex(df.index).month
df['day'] = pd.DatetimeIndex(df.index).day
df['dow'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['hour'] = pd.DatetimeIndex(df.index).hour

years = df.year.unique()

idxlist = []

for y in years:
    idx1 = df.loc[((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2))]
    #idx1 = df.iloc[df.get_loc(((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2)), method='nearest')]
    idxlist.append(idx1)

Edit based on Michael Delgado comments:

I have several years' worth of daily data, including for the correct days (first Sunday of April in every year).

Even though this works with my MWE, my actual dataset contains missing data and there may not be data for exactly 2am. Data is spaced roughly 20-35min intervals, so the closest value should be less than 15min away from the 2AM target.

I want to find the nearest datetime to 2am in the first Sunday in April. This is for every year in the DataFrame, but I'm not sure how to do this.

Michael Delgado · Accepted Answer

Based on your comments, it seems that you can rely on always having data within an hour of your desired time (1st Sunday of April) in each year. In this case, you can take a simpler approach.

Using an example dataset with variation in the times:

In [4]: df = pd.DataFrame(
   ...:     ...:     {'val': np.arange(24*366*10)},
   ...:     ...:     index=(
   ...:     ...:         pd.date_range('2010-01-01', periods=24*366*10, freq='H')
   ...:     ...:         + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
   ...:     ...:     ),
   ...:     ...: )

In [5]: df
Out[5]:
                       val
2010-01-01 00:14:00      0
2010-01-01 01:20:00      1
2010-01-01 01:46:00      2
2010-01-01 03:20:00      3
2010-01-01 03:51:00      4
...                    ...
2020-01-08 18:48:00  87835
2020-01-08 19:46:00  87836
2020-01-08 21:07:00  87837
2020-01-08 22:06:00  87838
2020-01-08 23:11:00  87839

[87840 rows x 1 columns]

We can filter based on times rounded to the nearest 2 hours:

within_an_hour = df[
    (df.index.month==4)
    & (df.index.day<=7)
    & (df.index.day_of_week == 6)
    & (df.index.round('2H').hour == 2)
]

We can then select the closest indices by taking the minimum absolute difference to the 2-hour rounded value for each year:

In [15]: closest_indices = (
    ...:     within_an_hour
    ...:     .groupby(within_an_hour.index.year)
    ...:     .apply(
    ...:         lambda x: x.index.values[np.argmin(abs(x.index - x.index.round('2H')))]
    ...:     )
    ...: )

In [16]: closest_indices
Out[16]:
2010   2010-04-04 02:17:00
2011   2011-04-03 02:22:00
2012   2012-04-01 01:49:00
2013   2013-04-07 01:39:00
2014   2014-04-06 02:01:00
2015   2015-04-05 01:58:00
2016   2016-04-03 02:12:00
2017   2017-04-02 01:54:00
2018   2018-04-01 02:22:00
2019   2019-04-07 02:13:00
dtype: datetime64[ns]

Pandas find nearest datetime index with conditional arguments

Answers (2)

Related Questions