Reputation: 2117
I have a dataframe with 2 columns. One for users and the other for destinations. I am looking to find the destination where ALL users visited. This is the dataframe
df1
user destination
0 1 dest1
1 2 dest3
2 3 dest4
3 2 dest1
4 2 dest4
5 3 dest1
6 4 dest2
7 4 dest1
8 1 dest4
9 4 dest4
Desired Output:
dest1
dest4
I am working with a dataset with thousands of users and destinations so this must scale out.
Upvotes: 1
Views: 161
Reputation: 38415
Check for equality between the sets of users grouped by id.
users = df.groupby('destination').user.apply(lambda x: np.equal(set(df.user.unique()),set(x)))
users[users]
You get
destination
dest1 True
dest4 True
If you need the destination as array
users[users].index
Index(['dest1', 'dest4'], dtype='object', name='destination')
Upvotes: 0
Reputation: 4757
You could try to unstack() the dataframe converting the users into columns. Then drop any row with an empty column (unvisited).
df.groupby(['destination', 'user']).size().unstack().dropna().reset_index().destination
Upvotes: 2
Reputation: 61910
You could try:
# find number of unique users
nunique = df.user.nunique()
# find number of unique users by destination
destinations = df.groupby('destination').user.nunique().to_frame().reset_index()
# find the destinations that match
result = destinations[destinations.user == nunique].drop('user', axis=1)
print(result)
Output
destination
0 dest1
3 dest4
Upvotes: 2