Steve
Steve

Reputation: 151

Is there a reason I cannot use the filter function on a pandas Dataframe after using the isin function?

Could someone help me work out why I cannot use the filter function on a Dataframe that I used the isin function on? I have imported answers.csv into a pandas dataframe. The following code is to remove all answers where 'deleted' is == 1.

#remove 'deleted answers' rows from the answers df - cleaning purposes

df_answers_del = df_answers[df_answers['deleted'].isin([0])]
print(df_answers_del)

this returns the answers dataframe with all the deleted answers removed from the df.

      answer_id  user_id  deleted   ...     comments
0             1     1015        0   ...          NaN
1             2     1019        0   ...          NaN
2             3     1015        0   ...          NaN
3             5     1015        0   ...          NaN
4             6     1099        0   ...          NaN            
5             7     1019        0   ...          NaN
6             9     1099        0   ...          NaN
7            10     1015        0   ...          NaN        

The idea is to now filter users that have answered all 100 questions by counting the number of occurrences of the user_id in the DF. The code below is cut off at the end, this is a copy of the code. #df_answers_del.groupby('user_id').filter(lambda x: len(x) > 100)

#filter all rows where users have answered all 100 questions
df_answers_del_completed = df_answers_del.groupby('user_id').filter(lambda x: len(x) > 100)
print(df_answers_del_completed)

Python does not return any error, just an empty dataframe.

Empty DataFrame
Columns: []
Index: []

If I remove the first chunk of code with the isin function, it returns as expected, just included some extra rows with deleted answers and also the incorrect number of completed users as the filter still counts the 'user_id' from the deleted answers.

       answer_id  user_id  deleted   ...     comments
0             1     1015        0   ...          NaN
1             2     1019        0   ...          NaN
2             3     1015        0   ...          NaN
3             4     1099        1   ...          NaN
4             5     1015        0   ...          NaN
5             6     1099        0   ...          NaN            
6             7     1019        0   ...          NaN
7             9     1099        0   ...          NaN
8            10     1015        0   ...          NaN  

I am wanting the results to look something like this. Assume that the users in the results have answered all 100 questions excluding deleted answers.

      answer_id  user_id  deleted   ...     comments
0             1     1015        0   ...          NaN
1             2     1019        0   ...          NaN
2             3     1015        0   ...          NaN
3             5     1015        0   ...          NaN
4             7     1019        0   ...          NaN               
5            10     1015        0   ...          NaN   

Thanks in advance

Upvotes: 1

Views: 106

Answers (2)

jezrael
jezrael

Reputation: 863541

Is there a reason I cannot use the filter function on a pandas Dataframe after using the isin function?

It depends what you need - if filter lengths from original data (my first solution) or filter from once filtered data like nixon answer or my second solution.


I believe you need chain 2 boolean mask in boolean indexing - first for filter removed rows and second for get all rows with length 100 :

mask1 = df_answers['deleted'] == 0
mask2 = df_answers.groupby('user_id')['user_id'].transform('size') == 100

df = df_answers[mask1 & mask2]

If performance is important and large data better is avoid filter function, because slow, check timings.

So if want filter again:

df_answers_del = df_answers[df_answers['deleted'] == 0]
mask = df_answers_del.groupby('user_id')['user_id'].transform('size') == 100
df_answers_del_completed  = df_answers_del[mask]

Upvotes: 2

yatu
yatu

Reputation: 88305

If you want to see which have answered all 100 questions, why are you using a > sign when filtering? Do:

df_answers_del_completed = (df_answers_del.groupby('user_id')
                                         .filter(lambda x: len(x) == 100))

Upvotes: 3

Related Questions