Reputation: 151
Could someone help me work out why I cannot use the filter function on a Dataframe that I used the isin function on? I have imported answers.csv into a pandas dataframe. The following code is to remove all answers where 'deleted' is == 1.
#remove 'deleted answers' rows from the answers df - cleaning purposes
df_answers_del = df_answers[df_answers['deleted'].isin([0])]
print(df_answers_del)
this returns the answers dataframe with all the deleted answers removed from the df.
answer_id user_id deleted ... comments
0 1 1015 0 ... NaN
1 2 1019 0 ... NaN
2 3 1015 0 ... NaN
3 5 1015 0 ... NaN
4 6 1099 0 ... NaN
5 7 1019 0 ... NaN
6 9 1099 0 ... NaN
7 10 1015 0 ... NaN
The idea is to now filter users that have answered all 100 questions by counting the number of occurrences of the user_id in the DF. The code below is cut off at the end, this is a copy of the code. #df_answers_del.groupby('user_id').filter(lambda x: len(x) > 100)
#filter all rows where users have answered all 100 questions
df_answers_del_completed = df_answers_del.groupby('user_id').filter(lambda x: len(x) > 100)
print(df_answers_del_completed)
Python does not return any error, just an empty dataframe.
Empty DataFrame
Columns: []
Index: []
If I remove the first chunk of code with the isin function, it returns as expected, just included some extra rows with deleted answers and also the incorrect number of completed users as the filter still counts the 'user_id' from the deleted answers.
answer_id user_id deleted ... comments
0 1 1015 0 ... NaN
1 2 1019 0 ... NaN
2 3 1015 0 ... NaN
3 4 1099 1 ... NaN
4 5 1015 0 ... NaN
5 6 1099 0 ... NaN
6 7 1019 0 ... NaN
7 9 1099 0 ... NaN
8 10 1015 0 ... NaN
I am wanting the results to look something like this. Assume that the users in the results have answered all 100 questions excluding deleted answers.
answer_id user_id deleted ... comments
0 1 1015 0 ... NaN
1 2 1019 0 ... NaN
2 3 1015 0 ... NaN
3 5 1015 0 ... NaN
4 7 1019 0 ... NaN
5 10 1015 0 ... NaN
Thanks in advance
Upvotes: 1
Views: 106
Reputation: 863541
Is there a reason I cannot use the filter function on a pandas Dataframe after using the isin function?
It depends what you need - if filter lengths from original data (my first solution) or filter from once filtered data like nixon answer or my second solution.
I believe you need chain 2 boolean mask in boolean indexing
- first for filter removed rows and second for get all rows with length 100
:
mask1 = df_answers['deleted'] == 0
mask2 = df_answers.groupby('user_id')['user_id'].transform('size') == 100
df = df_answers[mask1 & mask2]
If performance is important and large data better is avoid filter
function, because slow, check timings.
So if want filter again:
df_answers_del = df_answers[df_answers['deleted'] == 0]
mask = df_answers_del.groupby('user_id')['user_id'].transform('size') == 100
df_answers_del_completed = df_answers_del[mask]
Upvotes: 2
Reputation: 88305
If you want to see which have answered all 100 questions, why are you using a >
sign when filtering? Do:
df_answers_del_completed = (df_answers_del.groupby('user_id')
.filter(lambda x: len(x) == 100))
Upvotes: 3