The Great
The Great

Reputation: 7733

How to elegantly drop the identified records using pandas?

I have a pandas dataframe like as shown below

df = pd.DataFrame({'subject_id': [101,102,103,201,202],
                  'test_id':[21,21,np.nan,24,25],
                  'test_name':['A','B',np.nan,'D','E'],
                  'invalid_condition':[0,0,0,0,1]})

I would like to identify the issues in the data based on below conditions and drop them

While I am able to identify them using the code below, I am not sure how to drop them

subject_with_no_test_info = len(df.groupby('subject_id').filter(lambda x: x['test_id'].count() == 0))*100/len(df)
test_id_diff_names = len(df.groupby('test_id').filter(lambda x: x['test_name'].nunique() > 1))*100/len(df)
invalid_condition = len(df[df['invalid_condition']==1])*100/len(df)
data_inconsistencies_df = pd.DataFrame([[subject_with_no_test_info,test_id_diff_names,invalid_condition]],columns = ['subject_with_no_test_info','test_id_diff_names','invalid_condition'])

This gives me an output like below

enter image description here

but now I want to drop those records which contribute to the 20%, 40% and 20% under each column in the data_incosistencies_df?

Any elegant and efficient way to drop these records from the dataframe?

Upvotes: 1

Views: 58

Answers (1)

Anurag Dabas
Anurag Dabas

Reputation: 24324

Try:

d1=df.groupby('subject_id').filter(lambda x: x['test_id'].count() == 0)
d2=df.groupby('test_id').filter(lambda x: x['test_name'].nunique() > 1)
d3=df[df['invalid_condition']==1]
#your conditions

data_inconsistencies_df = pd.DataFrame([[(len(d1)*100/len(df)),(len(d2)*100/len(df)),(len(d3)*100/len(df))]],columns = ['subject_with_no_test_info','test_id_diff_names','invalid_condition'])
#created dataframe to show percentages

d2=d2.drop_duplicates('test_id',keep='last')
          #^your sub condition
          #(We don't drop  101, because we keep = first of duplicate test ids)
to_drop=pd.concat([d1,d2,d3]).index
#concatinating 3 dataframes to grab the index of the rows which are going to drop

Finally:

df=df.drop(to_drop)
#dropping those indexes

Output of df:

  subject_id    test_id     test_name   invalid_condition
0   101         21.0            A           0
3   201         24.0            D           0

Output of data_inconsistencies_df:

  subject_with_no_test_info     test_id_diff_names  invalid_condition
0          20.0                         40.0            20.0

Upvotes: 2

Related Questions