I have a dataset similar to below, I want to compare all rows for each inventory item based on the conditions that if "value" column is not null, then flag it "yes" Else if "value" column for all row (per each inventory) is null, then evaluate the "item" column, if either "item" column is not null, then that row should be flagged "yes". Or 1 record under "value" is null where 2 records under "value" is not null, then those 2 rows need to be evaluated based on the item In the case that if both "value" and "item" are both null(per each inventory), evaluate the "year" column and flag the most updated year "yes". Or "value" is null but 1 record under "item" is null where 2 records under "item" is not null, then evaluate the "year" column and flag the most updated year "yes" df['flag'].fillna('no',inplace=True) I can set up this if then else logic but I don't know how to compare current row with 2 previous rows. The groupby with transform and custom function are great advices and I wonder how to capture all these scenarios or if there is a better way. Below is how the simplified df would look like. In my real dataset, as described above, there are also cases that for the same inventory, "value" column are all null, but there are 2 rows under "item" are not null, in that case, "year" column needs to be evaluated for those 2 rows. df1 = { 'inventory':['inv1','inv1','inv1','inv2','inv2','inv2','inv3','inv3','inv3'], 'value':['xyz','','','','','','','',''], 'item':['','304','304','','205','','','',''], 'year':[2020,2020,2020,2020,2020,2020,2019,2018,2020]} df1=pd.DataFrame(df1) desired output would be like below - adding a flag column to flag 'yes'/'no' based on the above multiple condtions.

user032020

Reputation: 460

How to compare current row with previous 2 rows based on certain conditions in Python

I have a dataset similar to below, I want to compare all rows for each inventory item based on the conditions that

if "value" column is not null, then flag it "yes"
Else if "value" column for all row (per each inventory) is null, then evaluate the "item" column, if either "item" column is not null, then that row should be flagged "yes". Or 1 record under "value" is null where 2 records under "value" is not null, then those 2 rows need to be evaluated based on the item
In the case that if both "value" and "item" are both null(per each inventory), evaluate the "year" column and flag the most updated year "yes". Or "value" is null but 1 record under "item" is null where 2 records under "item" is not null, then evaluate the "year" column and flag the most updated year "yes"
df['flag'].fillna('no',inplace=True)

I can set up this if then else logic but I don't know how to compare current row with 2 previous rows. The groupby with transform and custom function are great advices and I wonder how to capture all these scenarios or if there is a better way.

Below is how the simplified df would look like. In my real dataset, as described above, there are also cases that for the same inventory, "value" column are all null, but there are 2 rows under "item" are not null, in that case, "year" column needs to be evaluated for those 2 rows.

df1 = { 'inventory':['inv1','inv1','inv1','inv2','inv2','inv2','inv3','inv3','inv3'],
  'value':['xyz','','','','','','','',''],
  'item':['','304','304','','205','','','',''],
   'year':[2020,2020,2020,2020,2020,2020,2019,2018,2020]}

df1=pd.DataFrame(df1)

desired output would be like below - adding a flag column to flag 'yes'/'no' based on the above multiple condtions.

Upvotes: 0

Answers (3)

user032020

Reputation: 460

By borrowing both @Peter Leimbigler and @richardec ideas, I modified the logic as below and it works for my real work situation (which is much more complex than what I presented in the question) because there are different numbers of rows for each inventory item and different combination on "value","item" and "year".

The only down side for my solution is I am not able to use groupby, so I had to create a loop to create individual df for each inventory item and append them in a full dataframe. I hope there would be a better way than that.

df1 = df1.replace('', np.nan)

def make_flag_col(df,n):
    comment_len = len(df.loc[df['value'].notna(),:])
    jus_len = len(df.loc[df['item'].notna(),:])
    maxi = df['year'].max()
    n = df.shape[0]
    if comment_len == 1:
        df.loc[df['value'].notna(),'flag'] = 'keep'
    elif comment_len >=2 or comment_len == 0:
        if jus_len == 1:
            df.loc[df['item'].notna(),'flag'] = 'keep'
        else:
            df.loc[df['year']==maxi,'flag'] = 'keep'
    else:
        print('out of pattern, check')
        return df

l = df1['inventory'].unique().tolist()

full_df = []
for itm in l:
    df_test = df1[df1['inventory']== itm].copy()
    df_test = df_test.sort_values(by=['year']).reset_index()
    df_row_number = df_test.shape[0]

    make_flag_col(df_test,df_row_number - 1)

    for i in range(len(df_test)):
        row = df_test.iloc[i]
        whole = list(row)
        full_df.append(whole)

full_df = pd.DataFrame(full_df,columns=list(df_test.columns))
full_df['flag'].fillna('remove',inplace=True)

Upvotes: 0

user17242583

Reputation:

This is actually not very complex. You mostly need a few groupby + transform combinations. Here's a vectorized (read: very fast) solution:

df = df.replace('', np.nan)

cond = (
    df['value'].notna() |
        (df['value'].isna().groupby(df['inventory']).transform('all') 
            & df['item'].notna()) |
        (df[['value', 'item']].isna().groupby(df['inventory']).transform('all').all(axis=1) &
            (df['year'] == df.groupby('inventory')['year'].transform('max')))
)

df['flag'] = cond.map({True: 'yes', False: 'no'})

Output:

>>> df
  inventory value item  year flag
0      inv1   xyz  NaN  2020  yes
1      inv1   NaN  304  2020   no
2      inv1   NaN  304  2020   no
3      inv2   NaN  NaN  2020   no
4      inv2   NaN  205  2020  yes
5      inv2   NaN  NaN  2020   no
6      inv3   NaN  NaN  2019   no
7      inv3   NaN  NaN  2018   no
8      inv3   NaN  NaN  2020  yes

Upvotes: 1

Peter Leimbigler

Reputation: 11105

You can write your rules into a custom function and apply it to each group:

# Replace blank spaces with NaN
df1 = df1.replace('', np.nan)

def make_flag_col(subdf):
    if subdf['value'].any():
        return subdf['value'].notna()
    elif subdf['item'].any():
        return subdf['item'].notna()
    else:
        return subdf['year'] == subdf['year'].max()

df1['flag'] = (df1.groupby('inventory', group_keys=False)
                  .apply(make_flag_col)
                  .replace({True: 'yes', False: 'no'}))

print(df1)

  inventory value item  year flag
0      inv1   xyz  NaN  2020  yes
1      inv1   NaN  304  2020   no
2      inv1   NaN  304  2020   no
3      inv2   NaN  NaN  2020   no
4      inv2   NaN  205  2020  yes
5      inv2   NaN  NaN  2020   no
6      inv3   NaN  NaN  2019   no
7      inv3   NaN  NaN  2018   no
8      inv3   NaN  NaN  2020  yes

Upvotes: 1

How to compare current row with previous 2 rows based on certain conditions in Python

Answers (3)

Related Questions