Seelfun
Seelfun

Reputation: 53

Get the Minimum and Maximum value within specific date range in DataFrame

I have a DataFrame that has the columns 'From' (datetime), 'To' (datetime). There are some overlapping in the ranges of different rows of the table.

Here is the simplified version of criteria dataframe (the date range is vary and overlapping with each other):

df1= pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D')})

    From    To
0   2020-01-01  2020-01-05
1   2020-01-03  2020-01-07
2   2020-01-05  2020-01-09
3   2020-01-07  2020-01-11
4   2020-01-09  2020-01-13
5   2020-01-11  2020-01-15
6   2020-01-13  2020-01-17
7   2020-01-15  2020-01-19
8   2020-01-17  2020-01-21
9   2020-01-19  2020-01-23
10  2020-01-21  2020-01-25
11  2020-01-23  2020-01-27
12  2020-01-25  2020-01-29
13  2020-01-27  2020-01-31
14  2020-01-29  2020-02-02
15  2020-01-31  2020-02-04

And I have a dataframe which keep the daily high and low value like this

random.seed(0)
df2= pd.DataFrame({'Date': pd.date_range(start='2020-01-01', end='2020-01-31'), 'High': [random.randint(7,15)+5 for i in range(31)], 'Low': [random.randint(0,7)-1 for i in range(31)]})

    Date    High    Low
0   2020-01-01  18  6
1   2020-01-02  18  6
2   2020-01-03  12  3
3   2020-01-04  16  -1
4   2020-01-05  20  -1
5   2020-01-06  19  0
6   2020-01-07  18  5
7   2020-01-08  16  -1
8   2020-01-09  19  6
9   2020-01-10  17  4
10  2020-01-11  15  2
11  2020-01-12  20  4
12  2020-01-13  14  0
13  2020-01-14  16  2
14  2020-01-15  14  2
15  2020-01-16  13  2
16  2020-01-17  16  1
17  2020-01-18  20  6
18  2020-01-19  14  0
19  2020-01-20  16  0
20  2020-01-21  13  4
21  2020-01-22  13  6
22  2020-01-23  17  0
23  2020-01-24  19  3
24  2020-01-25  20  3
25  2020-01-26  13  0
26  2020-01-27  17  4
27  2020-01-28  18  2
28  2020-01-29  17  3
29  2020-01-30  15  6
30  2020-01-31  20  0

Then I hope to get the maximum and minimum value based on the From Date and To Date in df1, Here is the expected result:

result = pd.DataFrame({'From': pd.date_range(start='2020-01-01', end='2020-01-31',freq='2D'), 'To': pd.date_range(start='2020-01-05', end='2020-02-04',freq='2D'), 'High':[20,20,20,19,20,20,16,20,20,17,20,20,20,20,20,20], 'Low':[-1,-1,-1,-1,0,0,1,0,0,0,0,0,0,0,0,0]})

    From    To  High    Low
0   2020-01-01  2020-01-05  20  -1
1   2020-01-03  2020-01-07  20  -1
2   2020-01-05  2020-01-09  20  -1
3   2020-01-07  2020-01-11  19  -1
4   2020-01-09  2020-01-13  20  0
5   2020-01-11  2020-01-15  20  0
6   2020-01-13  2020-01-17  16  1
7   2020-01-15  2020-01-19  20  0
8   2020-01-17  2020-01-21  20  0
9   2020-01-19  2020-01-23  17  0
10  2020-01-21  2020-01-25  20  0
11  2020-01-23  2020-01-27  20  0
12  2020-01-25  2020-01-29  20  0
13  2020-01-27  2020-01-31  20  0
14  2020-01-29  2020-02-02  20  0
15  2020-01-31  2020-02-04  20  0

I have tried to use resampling method, but it seems not support custom date range. I'm looking for a reasonably efficient and elegant way of doing this. Thank you very much.

Upvotes: 3

Views: 4884

Answers (5)

Ben.T
Ben.T

Reputation: 29635

With the size of the data, I think you should consider another approach, the idea is to vectorize by chunk over df1 the comparison between dates with df2. It is lot more lines than other solutions, but it will be way faster for large dataframes.

# this is a parameter you can play with, 
# but if your df1 is in memory, this value should work
nb_split = int((len(df1)*len(df2))//4e6)+1

# work with arrays of flaot
arr1 = df1[['From','To']].astype('int64').to_numpy().astype(float)
arr2 = df2.astype('int64').to_numpy().astype(float)
# create result array
arr_out = np.zeros((len(arr1), 2), dtype=float)
i = 0 #index position
for arr1_sp in np.array_split(arr1, nb_split, axis=0):
    # get length of the chunk
    lft = len(arr1_sp)
    # get the min datetime in From and max in To
    min_from = arr1_sp[:, 0].min()
    max_to = arr1_sp[:, 1].max()

    # select the rows of arr2 tht are within the min and max date of the split
    arr2_sp = arr2[(arr2[:,0]>=min_from)&(arr2[:,0]<=max_to), :]

    # create an bool arraywith True when the date in arr2_sp is above from and below to
    # each row is the reuslt for each row of arr1_sp
    m = np.less_equal.outer(arr1_sp[:,0], arr2_sp[:, 0])\
        &np.greater_equal.outer(arr1_sp[:,1], arr2_sp[:, 0])

    # use this mask to get the values high and low within the range row-wise
    # and replace where the mask was False by np.nan
    arr_high = arr2_sp[:,1]*m
    arr_high[~m] = np.nan
    arr_low = arr2_sp[:,2]*m
    arr_low[~m] = np.nan

    # put the result in the result array
    arr_out[i:i+lft, 0] = np.nanmax(arr_high, axis=1)
    arr_out[i:i+lft, 1] = np.nanmin(arr_low, axis=1)
    i += lft #update first idx position for next loop

# create the columns in df1
df1['High'] = arr_out[:, 0]
df1['Low'] = arr_out[:, 1]

I tried with df1 with 10000 rows and df2 5000 rows, and this method is about 102ms while the method with apply getHighLow2is about 8s, so 80 time faster this way. Adn the results where the same.

Upvotes: 1

LevB
LevB

Reputation: 953

You can create a simple function that gets the min and max within a given date renge. Than use the apply function to add the columns.

def MaxMin(row):
    dfRange = df2[(df2['Date']>=row['From'])&(df2['Date']<=row['To'])] # df2 rows within a given date range
    row['High'] = dfRange['High'].max()
    row['Low'] = dfRange['Low'].min()
    return row

df1 = df1.apply(MaxMin, axis =1)

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150735

I would do a cross merge and query, then groupby:

(df1.assign(dummy=1)
   .merge(df2.assign(dummy=1), on='dummy')   # this is cross merge
   .drop('dummy', axis=1)                    # remove the `dummy` column
   .query('From<=Date<=To')                  # only choose valid data
   .groupby(['From','To'])                   # groupby `From` and `To`
   .agg({'High':'max','Low':'min'})          # aggregation
   .reset_index()                            
)

Output:

         From         To  High  Low
0  2020-01-01 2020-01-05    20   -1
1  2020-01-03 2020-01-07    20   -1
2  2020-01-05 2020-01-09    20   -1
3  2020-01-07 2020-01-11    19   -1
4  2020-01-09 2020-01-13    20    0
5  2020-01-11 2020-01-15    20    0
6  2020-01-13 2020-01-17    16    0
7  2020-01-15 2020-01-19    20    0
8  2020-01-17 2020-01-21    20    0
9  2020-01-19 2020-01-23    17    0
10 2020-01-21 2020-01-25    20    0
11 2020-01-23 2020-01-27    20    0
12 2020-01-25 2020-01-29    20    0
13 2020-01-27 2020-01-31    20    0
14 2020-01-29 2020-02-02    20    0
15 2020-01-31 2020-02-04    20    0

Upvotes: 1

Valdi_Bo
Valdi_Bo

Reputation: 30971

Define the following function:

def getHighLow(row):
    wrk = df2[df2.Date.between(row.From, row.To)]
    return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])

Then run:

df1.join(df1.apply(getHighLow, axis=1))

According to the DRY rule, it is better to find wrk (a set of rows between given dates) once and then (form wrk) extract maximal High and minimal Low.

Another advantage over the other solution: My code runs quicker by about 30 % (at least on my computer, measurements performed using %timeit).

Edit

Yet quicker solution is when the search in df2 can be performed by index instead of "from regular column".

As a preparatory step run:

df2a = df2.set_index('Date')

Then define another variant of getHighLow function:

def getHighLow2(row):
    wrk = df2a.loc[row.From : row.To]
    return pd.Series([wrk.High.max(), wrk.Low.min()], index=['High', 'Low'])

To get the result, run:

df1.join(df1.apply(getHighLow2, axis=1))

For your data, the execution time is about a half of the other solution (not including the time to create df2a, but it can be created just in this form (with Date as the index)).

Upvotes: 0

nocibambi
nocibambi

Reputation: 2421

Here is a function which does this:

  • Checks the dates which are in the from/to interval
  • Gets the maximum and minimum values of the High and Low columns respectively
def get_high_low(d1):

    high = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "High"].max()
    low = df2.loc[df2["Date"].isin(pd.date_range(d1["From"], d1["To"])), "Low"].max()

    return pd.Series([high, low], index=["High", "Low"])

Then we can just apply this function and concatenate the result with the dates.

pd.concat([df1, df1.apply(get_high_low, axis=1)], axis=1)

The result

    From    To  High    Low
0   2020-01-01  2020-01-05  19  4
1   2020-01-03  2020-01-07  17  5
2   2020-01-05  2020-01-09  19  5
3   2020-01-07  2020-01-11  19  2
4   2020-01-09  2020-01-13  17  4
5   2020-01-11  2020-01-15  19  4
6   2020-01-13  2020-01-17  19  5
7   2020-01-15  2020-01-19  18  5
8   2020-01-17  2020-01-21  18  0
9   2020-01-19  2020-01-23  19  3
10  2020-01-21  2020-01-25  19  5
11  2020-01-23  2020-01-27  19  5
12  2020-01-25  2020-01-29  17  5
13  2020-01-27  2020-01-31  17  3
14  2020-01-29  2020-02-02  17  1
15  2020-01-31  2020-02-04  13  -1

Upvotes: 1

Related Questions