Finding minimum value of a column between two entries in another column

Question

Viewed 64 times

0

I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.

Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".

SAMPLE INPUT

ROWNUM  A    B
1      None  68
2      None  83
3        X   51
4      None  66
5      None  90
6        Y   81
7      None  81
8      None  100
9      None  83
10     None  78
11       X   68
12     None  53
13     None  83
14       Y   68
15     None  94
16     None  50
17     None  71
18     None  71
19     None  52
20     None  67
21     None  82
22       X   76
23     None  66
24     None  92

For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.

EDIT:

In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.

Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.

I have the following code.

import numpy as np
import pandas as pd


df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
        lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)

This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.

Valdi_Bo · Accepted Answer

I read your sample data as follows:

df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])

Note na_values=['None'], which provides that None in input (a string) is read as NaN.

This way the DataFrame is:

    ROWNUM    A    B
0        1  NaN   68
1        2  NaN   83
2        3    X   51
3        4  NaN   66
4        5  NaN   90
5        6    Y   81
6        7  NaN   81
7        8  NaN  100
8        9  NaN   83
9       10  NaN   78
10      11    X   68
11      12  NaN   53
12      13  NaN   83
13      14    Y   69
14      15  NaN   94
15      16  NaN   50
16      17  NaN   71
17      18  NaN   71
18      19  NaN   52
19      20  NaN   67
20      21  NaN   82
21      22    X   76
22      23  NaN   66
23      24  NaN   92

The code to do your task is:

res = df.index[df.A.notnull()].to_series().rolling(4).apply(
    lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
    .dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)

As you can see, it is in part a slight change of your code, with some additions.

The result is:

    ROWNUM    A    B Result
0        1  NaN   68       
1        2  NaN   83       
2        3    X   51       
3        4  NaN   66       
4        5  NaN   90       
5        6    Y   81       
6        7  NaN   81       
7        8  NaN  100       
8        9  NaN   83       
9       10  NaN   78       
10      11    X   68       
11      12  NaN   53       
12      13  NaN   83       
13      14    Y   69    YES
14      15  NaN   94       
15      16  NaN   50       
16      17  NaN   71       
17      18  NaN   71       
18      19  NaN   52       
19      20  NaN   67       
20      21  NaN   82       
21      22    X   76     NO
22      23  NaN   66       
23      24  NaN   92

The advantage of my solution over the other is that:

the content is either YES or NO, just as you want,
this content shows up only for non-null values in A column, "ignoring" first 3, which don't have enough "predecessors".

Finding minimum value of a column between two entries in another column

Answers (2)

Related Questions