Medulla Oblongata
Medulla Oblongata

Reputation: 3961

Pandas find nearest datetime index with conditional arguments

I'm trying to find the nearest datetime index of my table. I'm using this post as a starting point, and am using this MWE:

import os
import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta

df = pd.DataFrame() 
df['datetime'] = pd.date_range(start='2019-01-01', end='2021-01-01', freq='H')
df = df.set_index('datetime')

df['year'] = pd.DatetimeIndex(df.index).year
df['mnth'] = pd.DatetimeIndex(df.index).month
df['day'] = pd.DatetimeIndex(df.index).day
df['dow'] = pd.DatetimeIndex(df.index).dayofweek # Mon=0, ..., Sun=6
df['hour'] = pd.DatetimeIndex(df.index).hour

years = df.year.unique()

idxlist = []

for y in years:
    idx1 = df.loc[((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2))]
    #idx1 = df.iloc[df.get_loc(((df.year==y) & (df.mnth==4) & (df.day<=7) & (df.dow==6) & (df.hour==2)), method='nearest')]
    idxlist.append(idx1)

Edit based on Michael Delgado comments:

I have several years' worth of daily data, including for the correct days (first Sunday of April in every year).

Even though this works with my MWE, my actual dataset contains missing data and there may not be data for exactly 2am. Data is spaced roughly 20-35min intervals, so the closest value should be less than 15min away from the 2AM target.

I want to find the nearest datetime to 2am in the first Sunday in April. This is for every year in the DataFrame, but I'm not sure how to do this.

Upvotes: 0

Views: 429

Answers (2)

Michael Delgado
Michael Delgado

Reputation: 15432

Based on your comments, it seems that you can rely on always having data within an hour of your desired time (1st Sunday of April) in each year. In this case, you can take a simpler approach.

Using an example dataset with variation in the times:

In [4]: df = pd.DataFrame(
   ...:     ...:     {'val': np.arange(24*366*10)},
   ...:     ...:     index=(
   ...:     ...:         pd.date_range('2010-01-01', periods=24*366*10, freq='H')
   ...:     ...:         + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
   ...:     ...:     ),
   ...:     ...: )

In [5]: df
Out[5]:
                       val
2010-01-01 00:14:00      0
2010-01-01 01:20:00      1
2010-01-01 01:46:00      2
2010-01-01 03:20:00      3
2010-01-01 03:51:00      4
...                    ...
2020-01-08 18:48:00  87835
2020-01-08 19:46:00  87836
2020-01-08 21:07:00  87837
2020-01-08 22:06:00  87838
2020-01-08 23:11:00  87839

[87840 rows x 1 columns]

We can filter based on times rounded to the nearest 2 hours:

within_an_hour = df[
    (df.index.month==4)
    & (df.index.day<=7)
    & (df.index.day_of_week == 6)
    & (df.index.round('2H').hour == 2)
]

We can then select the closest indices by taking the minimum absolute difference to the 2-hour rounded value for each year:

In [15]: closest_indices = (
    ...:     within_an_hour
    ...:     .groupby(within_an_hour.index.year)
    ...:     .apply(
    ...:         lambda x: x.index.values[np.argmin(abs(x.index - x.index.round('2H')))]
    ...:     )
    ...: )

In [16]: closest_indices
Out[16]:
2010   2010-04-04 02:17:00
2011   2011-04-03 02:22:00
2012   2012-04-01 01:49:00
2013   2013-04-07 01:39:00
2014   2014-04-06 02:01:00
2015   2015-04-05 01:58:00
2016   2016-04-03 02:12:00
2017   2017-04-02 01:54:00
2018   2018-04-01 02:22:00
2019   2019-04-07 02:13:00
dtype: datetime64[ns]

Upvotes: 1

Michael Delgado
Michael Delgado

Reputation: 15432

This is a bit of a challenge, just because "the first sunday of April in any year" takes a few steps to compute. You could approach this in a few ways, but I'll approach this by first computing the nearest target date in the year of the target date, as well as the following year (because April is always in the first part of the year, so the date will never be closer to the April in the preceding year), then finding the minimum absolute difference to either of the targets.

First step, I'll expand on your MWE with some random offsets (+/- 30 mins) and a longer time series. I also added a value column so the df displays as a frame:

In [26]: df = pd.DataFrame(
    ...:     {'val': np.arange(24*366*10)},
    ...:     index=(
    ...:         pd.date_range('2010-01-01', periods=24*366*10, freq='H')
    ...:         + pd.to_timedelta(np.random.randint(-30, 30, size=(24*366*10)), unit='minutes')
    ...:     ),
    ...: )

In [27]: df
Out[27]:
                       val
2010-01-01 00:29:00      0
2010-01-01 01:09:00      1
2010-01-01 01:43:00      2
2010-01-01 03:14:00      3
2010-01-01 03:54:00      4
...                    ...
2020-01-08 18:31:00  87835
2020-01-08 20:21:00  87836
2020-01-08 20:54:00  87837
2020-01-08 21:47:00  87838
2020-01-08 23:11:00  87839

Next, I find the date of the first Sunday in April (at 2 AM) for the year of each row:

In [28]: apr1 = pd.to_datetime({'year': df.index.year, 'month': 4, 'day': 1, 'hour': 2})

In [29]: apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')

In [30]: apr_first_sun
Out[30]:
0       2010-04-04 02:00:00
1       2010-04-04 02:00:00
2       2010-04-04 02:00:00
3       2010-04-04 02:00:00
4       2010-04-04 02:00:00
                ...
87835   2020-04-05 02:00:00
87836   2020-04-05 02:00:00
87837   2020-04-05 02:00:00
87838   2020-04-05 02:00:00
87839   2020-04-05 02:00:00
Length: 87840, dtype: datetime64[ns]

In [31]: apr1 = pd.to_datetime({'year': df.index.year + 1, 'month': 4, 'day': 1, 'hour': 2})

In [32]: next_apr_first_sun = apr1 + pd.to_timedelta(6 - apr1.dt.weekday, unit='day')

Next, find the closer absolute difference:

In [36]: nearer_abs_diff = np.minimum(abs(df.index - apr_first_sun.values), abs(df.index - next_apr_first_sun.values))

In [37]: nearer_abs_diff
Out[37]:
TimedeltaIndex(['93 days 01:31:00', '93 days 00:51:00', '93 days 00:17:00',
                '92 days 22:46:00', '92 days 22:06:00', '92 days 20:54:00',
                '92 days 20:23:00', '92 days 19:25:00', '92 days 18:12:00',
                '92 days 16:48:00',
                ...
                '87 days 12:19:00', '87 days 11:12:00', '87 days 09:36:00',
                '87 days 08:31:00', '87 days 07:36:00', '87 days 07:29:00',
                '87 days 05:39:00', '87 days 05:06:00', '87 days 04:13:00',
                '87 days 02:49:00'],
               dtype='timedelta64[ns]', length=87840, freq=None

Finally, find the positional index of the minimum absolute difference and use that to index into the dataframe:

In [38]: idx = np.argmin(nearer_abs_diff)

In [39]: df.iloc[idx]
Out[39]:
val    37346
Name: 2014-04-06 02:14:00, dtype: int64

Upvotes: 1

Related Questions