Reputation: 960
I am using pandas library to read data from a .json file. I have the following example dataset:
import pandas as pd
df = pd.DataFrame([['1', 'book1'],
['1', 'book2'],
['1', 'book6'],
['2', 'book1'],
['2', 'book6'],
['2', 'book3'],
['3', 'book4'],
['3', 'book2'],
['4', 'book3'],
],
columns=['visitor_id', 'bookid'])
This dataset shows the user_id and which book this user has read. My question is, is it possible to filter out the list of books other users have read based on a given book value? For example, which other books other users have read which also read book 1. So the result should have 'book6', 'book3' and 'book2' if I were to search for 'book1'.
Is it possible to do it directly using any pandas function? The closest I could find in the documentation was groupby() function but I could not get it to work. I want to avoid going through each value using a loop as my dataset is quite big.
Upvotes: 1
Views: 352
Reputation: 862511
You can filter all visitor_id
where is match mask m
for compare by Series.eq
, then filter original visitor_id
with Series.isin
and last filter all groups without m
mask by inverted it by ~
to Series, then to unique list:
m = df['bookid'].eq('book1')
a = df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'].unique().tolist()
print (a)
['book2', 'book6', 'book3']
Or you can use GroupBy.transform
with GroupBy.any
for test at least one match per groups:
m = df['bookid'].eq('book1')
a = df.loc[m.groupby(df['visitor_id']).transform('any') & ~m, 'bookid'].unique().tolist()
print (a)
['book2', 'book6', 'book3']
Explanation:
First compare values to boolean mask:
m = df['bookid'].eq('book1')
print (m)
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
Name: bookid, dtype: bool
Then filter visitor_id
it is combination boolean indexing
:
print (df[m])
visitor_id bookid
0 1 book1
3 2 book1
with DataFrame.loc
for filter by mask and column name, here visitor_id
to Series
:
print (df.loc[m, 'visitor_id'])
0 1
3 2
Name: visitor_id, dtype: object
Now filter again by values from Series
for get all rows by groups:
print (df['visitor_id'].isin(df.loc[m, 'visitor_id']))
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
Name: visitor_id, dtype: bool
But because want remove rows with mask m
, first invert mask for True->False, False->True mapping:
print (~m)
0 False
1 True
2 True
3 False
4 True
5 True
6 True
7 True
8 True
Name: bookid, dtype: bool
And chain by &
for bitwise AND
:
print (df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m)
0 False
1 True
2 True
3 False
4 True
5 True
6 False
7 False
8 False
dtype: bool bookid
This mask is used for filter bookid
column:
print (df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'])
1 book2
2 book6
4 book6
5 book3
Name: bookid, dtype: object
And last is convert Series
to unique array and to list:
print (df.loc[df['visitor_id'].isin(df.loc[m, 'visitor_id']) & ~m, 'bookid'].unique().tolist())
['book2', 'book6', 'book3']
Upvotes: 1