Reputation: 65
Viewed 64 times
0
I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.
Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".
SAMPLE INPUT
ROWNUM A B
1 None 68
2 None 83
3 X 51
4 None 66
5 None 90
6 Y 81
7 None 81
8 None 100
9 None 83
10 None 78
11 X 68
12 None 53
13 None 83
14 Y 68
15 None 94
16 None 50
17 None 71
18 None 71
19 None 52
20 None 67
21 None 82
22 X 76
23 None 66
24 None 92
For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.
EDIT:
In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.
Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.
I have the following code.
import numpy as np
import pandas as pd
df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)
This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.
Upvotes: 1
Views: 479
Reputation: 31011
I read your sample data as follows:
df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])
Note na_values=['None']
, which provides that None in input (a string)
is read as NaN.
This way the DataFrame is:
ROWNUM A B
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76
22 23 NaN 66
23 24 NaN 92
The code to do your task is:
res = df.index[df.A.notnull()].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
.dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)
As you can see, it is in part a slight change of your code, with some additions.
The result is:
ROWNUM A B Result
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69 YES
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76 NO
22 23 NaN 66
23 24 NaN 92
The advantage of my solution over the other is that:
Upvotes: 1
Reputation: 150825
Here's my approach:
def is_incr(x):
return x[:2].min() > x[2:].min()
# replace with s = df['A'] == 'None' if needed
s = df['A'].isna()
df['new_col'] = df.loc[s, 'B'].rolling(4).apply(is_incr)
Output:
ROWNUM A B new_col
0 1 NaN 68 NaN
1 2 NaN 83 NaN
2 3 X 51 NaN
3 4 NaN 66 NaN
4 5 NaN 90 1.0
5 6 Y 81 NaN
6 7 NaN 81 0.0
7 8 NaN 100 0.0
8 9 NaN 83 0.0
9 10 NaN 78 1.0
10 11 X 68 NaN
11 12 NaN 53 1.0
12 13 NaN 83 1.0
13 14 Y 68 NaN
14 15 NaN 94 0.0
15 16 NaN 50 1.0
16 17 NaN 71 1.0
17 18 NaN 71 0.0
18 19 NaN 52 0.0
19 20 NaN 67 1.0
20 21 NaN 82 0.0
21 22 X 76 NaN
22 23 NaN 66 0.0
23 24 NaN 92 1.0
Upvotes: 1