Pandas, search for results in another column based on condition

Question

I am using pandas library to read data from a .json file. I have the following example dataset:

import pandas as pd

df = pd.DataFrame([['1', 'book1'],
                   ['1', 'book2'],
                   ['1', 'book6'],
                   ['2', 'book1'],
                   ['2', 'book6'],
                   ['2', 'book3'],
                   ['3', 'book4'],
                   ['3', 'book2'],
                   ['4', 'book3'],
                  ],
                  columns=['visitor_id', 'bookid'])

This dataset shows the user_id and which book this user has read. My question is, is it possible to filter out the list of books other users have read based on a given book value? For example, which other books other users have read which also read book 1. So the result should have 'book6', 'book3' and 'book2' if I were to search for 'book1'.

Is it possible to do it directly using any pandas function? The closest I could find in the documentation was groupby() function but I could not get it to work. I want to avoid going through each value using a loop as my dataset is quite big.

jezrael · Accepted Answer

You can filter all visitor_id where is match mask m for compare by Series.eq, then filter original visitor_id with Series.isin and last filter all groups without m mask by inverted it by ~ to Series, then to unique list:

m = df['bookid'].eq('book1')
a = df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'].unique().tolist()
print (a)
['book2', 'book6', 'book3']

Or you can use GroupBy.transform with GroupBy.any for test at least one match per groups:

m = df['bookid'].eq('book1')
a = df.loc[m.groupby(df['visitor_id']).transform('any')  & ~m, 'bookid'].unique().tolist()
print (a)
['book2', 'book6', 'book3']

Explanation:

First compare values to boolean mask:

m = df['bookid'].eq('book1') 
print (m)
0     True
1    False
2    False
3     True
4    False
5    False
6    False
7    False
8    False
Name: bookid, dtype: bool

Then filter visitor_id it is combination boolean indexing:

print (df[m])
  visitor_id bookid
0          1  book1
3          2  book1

with DataFrame.loc for filter by mask and column name, here visitor_id to Series:

print (df.loc[m, 'visitor_id'])
0    1
3    2
Name: visitor_id, dtype: object

Now filter again by values from Series for get all rows by groups:

print (df['visitor_id'].isin(df.loc[m, 'visitor_id']))
0     True
1     True
2     True
3     True
4     True
5     True
6    False
7    False
8    False
Name: visitor_id, dtype: bool

But because want remove rows with mask m, first invert mask for True->False, False->True mapping:

print (~m)
0    False
1     True
2     True
3    False
4     True
5     True
6     True
7     True
8     True
Name: bookid, dtype: bool

And chain by & for bitwise AND:

print (df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m)
0    False
1     True
2     True
3    False
4     True
5     True
6    False
7    False
8    False
dtype: bool bookid

This mask is used for filter bookid column:

print (df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'])
1    book2
2    book6
4    book6
5    book3
Name: bookid, dtype: object

And last is convert Series to unique array and to list:

print (df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'].unique().tolist())
['book2', 'book6', 'book3']

Pandas, search for results in another column based on condition

Answers (1)

Related Questions