Reputation: 460
I have a dataset similar to below, I want to compare all rows for each inventory item based on the conditions that
if "value" column is not null, then flag it "yes"
Else if "value" column for all row (per each inventory) is null, then evaluate the "item" column, if either "item" column is not null, then that row should be flagged "yes". Or 1 record under "value" is null where 2 records under "value" is not null, then those 2 rows need to be evaluated based on the item
In the case that if both "value" and "item" are both null(per each inventory), evaluate the "year" column and flag the most updated year "yes". Or "value" is null but 1 record under "item" is null where 2 records under "item" is not null, then evaluate the "year" column and flag the most updated year "yes"
df['flag'].fillna('no',inplace=True)
I can set up this if then else logic but I don't know how to compare current row with 2 previous rows. The groupby with transform and custom function are great advices and I wonder how to capture all these scenarios or if there is a better way.
Below is how the simplified df would look like. In my real dataset, as described above, there are also cases that for the same inventory, "value" column are all null, but there are 2 rows under "item" are not null, in that case, "year" column needs to be evaluated for those 2 rows.
df1 = { 'inventory':['inv1','inv1','inv1','inv2','inv2','inv2','inv3','inv3','inv3'],
'value':['xyz','','','','','','','',''],
'item':['','304','304','','205','','','',''],
'year':[2020,2020,2020,2020,2020,2020,2019,2018,2020]}
df1=pd.DataFrame(df1)
desired output would be like below - adding a flag column to flag 'yes'/'no' based on the above multiple condtions.
Upvotes: 0
Views: 484
Reputation: 460
By borrowing both @Peter Leimbigler and @richardec ideas, I modified the logic as below and it works for my real work situation (which is much more complex than what I presented in the question) because there are different numbers of rows for each inventory item and different combination on "value","item" and "year".
The only down side for my solution is I am not able to use groupby, so I had to create a loop to create individual df for each inventory item and append them in a full dataframe. I hope there would be a better way than that.
df1 = df1.replace('', np.nan)
def make_flag_col(df,n):
comment_len = len(df.loc[df['value'].notna(),:])
jus_len = len(df.loc[df['item'].notna(),:])
maxi = df['year'].max()
n = df.shape[0]
if comment_len == 1:
df.loc[df['value'].notna(),'flag'] = 'keep'
elif comment_len >=2 or comment_len == 0:
if jus_len == 1:
df.loc[df['item'].notna(),'flag'] = 'keep'
else:
df.loc[df['year']==maxi,'flag'] = 'keep'
else:
print('out of pattern, check')
return df
l = df1['inventory'].unique().tolist()
full_df = []
for itm in l:
df_test = df1[df1['inventory']== itm].copy()
df_test = df_test.sort_values(by=['year']).reset_index()
df_row_number = df_test.shape[0]
make_flag_col(df_test,df_row_number - 1)
for i in range(len(df_test)):
row = df_test.iloc[i]
whole = list(row)
full_df.append(whole)
full_df = pd.DataFrame(full_df,columns=list(df_test.columns))
full_df['flag'].fillna('remove',inplace=True)
Upvotes: 0
Reputation:
This is actually not very complex. You mostly need a few groupby
+ transform
combinations. Here's a vectorized (read: very fast) solution:
df = df.replace('', np.nan)
cond = (
df['value'].notna() |
(df['value'].isna().groupby(df['inventory']).transform('all')
& df['item'].notna()) |
(df[['value', 'item']].isna().groupby(df['inventory']).transform('all').all(axis=1) &
(df['year'] == df.groupby('inventory')['year'].transform('max')))
)
df['flag'] = cond.map({True: 'yes', False: 'no'})
Output:
>>> df
inventory value item year flag
0 inv1 xyz NaN 2020 yes
1 inv1 NaN 304 2020 no
2 inv1 NaN 304 2020 no
3 inv2 NaN NaN 2020 no
4 inv2 NaN 205 2020 yes
5 inv2 NaN NaN 2020 no
6 inv3 NaN NaN 2019 no
7 inv3 NaN NaN 2018 no
8 inv3 NaN NaN 2020 yes
Upvotes: 1
Reputation: 11105
You can write your rules into a custom function and apply it to each group:
# Replace blank spaces with NaN
df1 = df1.replace('', np.nan)
def make_flag_col(subdf):
if subdf['value'].any():
return subdf['value'].notna()
elif subdf['item'].any():
return subdf['item'].notna()
else:
return subdf['year'] == subdf['year'].max()
df1['flag'] = (df1.groupby('inventory', group_keys=False)
.apply(make_flag_col)
.replace({True: 'yes', False: 'no'}))
print(df1)
inventory value item year flag
0 inv1 xyz NaN 2020 yes
1 inv1 NaN 304 2020 no
2 inv1 NaN 304 2020 no
3 inv2 NaN NaN 2020 no
4 inv2 NaN 205 2020 yes
5 inv2 NaN NaN 2020 no
6 inv3 NaN NaN 2019 no
7 inv3 NaN NaN 2018 no
8 inv3 NaN NaN 2020 yes
Upvotes: 1