Reputation: 593
So, my data is travel data.
I want to create a column df['user_type']
in which it'll determine if the df['user_id']
occurs more than once. If it does occur more than once, I'll list them as a frequent user.
Here is my code below, but it takes way too long:
#Column that determines user type
def determine_user_type(val):
df_freq = df[df['user_id'].duplicated()]
user_type = ""
if(val in df_freq['user_id'].values):
user_type = "Frequent"
else:
user_type = "Single"
return user_type
df['user_type'] = df['user_id'].apply(lambda x: determine_user_type(x))
Upvotes: 3
Views: 156
Reputation: 863166
Use numpy.where
with duplicated
and for return all dupes add parameter keep=False
:
df = pd.DataFrame({'user_id':list('aaacbbt')})
df['user_type'] = np.where(df['user_id'].duplicated(keep=False), 'Frequent','Single')
Alternative:
d = {True:'Frequent',False:'Single'}
df['user_type'] = df['user_id'].duplicated(keep=False).map(d)
print (df)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
EDIT:
df = pd.DataFrame({'user_id':list('aaacbbt')})
print (df)
user_id
0 a
1 a
2 a
3 c
4 b
5 b
6 t
Here drop_duplicates
remove all duplicates row by column user_id
and return only first row (default parameter is keep='first'
):
df_single = df.drop_duplicates('user_id')
print (df_single)
user_id
0 a
3 c
4 b
6 t
But Series.duplicated
return True
s for all dupes without first:
print (df['user_id'].duplicated())
0 False
1 True
2 True
3 False
4 False
5 True
6 False
Name: user_id, dtype: bool
df_freq = df[df['user_id'].duplicated()]
print (df_freq)
user_id
1 a
2 a
5 b
Upvotes: 4
Reputation: 323326
Data from Jez , method involve value_counts
df.user_id.map(df.user_id.value_counts().gt(1).replace({True:'Frequent',False:'Single'}))
Out[52]:
0 Frequent
1 Frequent
2 Frequent
3 Single
4 Frequent
5 Frequent
6 Single
Name: user_id, dtype: object
Upvotes: 2
Reputation: 294488
Using jezrael's data
df = pd.DataFrame({'user_id':list('aaacbbt')})
You can use array slicing
df.assign(
user_type=
np.array(['Single', 'Frequent'])[
df['user_id'].duplicated(keep=False).astype(int)
]
)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
Upvotes: 2