Reputation: 7733
I have a pandas dataframe like as shown below
df = pd.DataFrame({'subject_id': [101,102,103,201,202],
'test_id':[21,21,np.nan,24,25],
'test_name':['A','B',np.nan,'D','E'],
'invalid_condition':[0,0,0,0,1]})
I would like to identify the issues in the data based on below conditions and drop them
While I am able to identify them using the code below, I am not sure how to drop them
subject_with_no_test_info = len(df.groupby('subject_id').filter(lambda x: x['test_id'].count() == 0))*100/len(df)
test_id_diff_names = len(df.groupby('test_id').filter(lambda x: x['test_name'].nunique() > 1))*100/len(df)
invalid_condition = len(df[df['invalid_condition']==1])*100/len(df)
data_inconsistencies_df = pd.DataFrame([[subject_with_no_test_info,test_id_diff_names,invalid_condition]],columns = ['subject_with_no_test_info','test_id_diff_names','invalid_condition'])
This gives me an output like below
but now I want to drop those records which contribute to the 20%, 40% and 20%
under each column in the data_incosistencies_df?
Any elegant and efficient way to drop these records from the dataframe?
Upvotes: 1
Views: 58
Reputation: 24324
Try:
d1=df.groupby('subject_id').filter(lambda x: x['test_id'].count() == 0)
d2=df.groupby('test_id').filter(lambda x: x['test_name'].nunique() > 1)
d3=df[df['invalid_condition']==1]
#your conditions
data_inconsistencies_df = pd.DataFrame([[(len(d1)*100/len(df)),(len(d2)*100/len(df)),(len(d3)*100/len(df))]],columns = ['subject_with_no_test_info','test_id_diff_names','invalid_condition'])
#created dataframe to show percentages
d2=d2.drop_duplicates('test_id',keep='last')
#^your sub condition
#(We don't drop 101, because we keep = first of duplicate test ids)
to_drop=pd.concat([d1,d2,d3]).index
#concatinating 3 dataframes to grab the index of the rows which are going to drop
Finally:
df=df.drop(to_drop)
#dropping those indexes
Output of df
:
subject_id test_id test_name invalid_condition
0 101 21.0 A 0
3 201 24.0 D 0
Output of data_inconsistencies_df
:
subject_with_no_test_info test_id_diff_names invalid_condition
0 20.0 40.0 20.0
Upvotes: 2