Reputation: 593

How to make a dummy column determining whether a column cell value is a duplicate?

So, my data is travel data.

I want to create a column df['user_type'] in which it'll determine if the df['user_id'] occurs more than once. If it does occur more than once, I'll list them as a frequent user.

Here is my code below, but it takes way too long:

#Column that determines user type
def determine_user_type(val):
  df_freq = df[df['user_id'].duplicated()]

  user_type = ""
  if(val in df_freq['user_id'].values):
    user_type = "Frequent"
  else:
    user_type = "Single"

return user_type

df['user_type'] = df['user_id'].apply(lambda x: determine_user_type(x))

Upvotes: 3

Answers (3)

jezrael

Reputation: 863166

Use numpy.where with duplicated and for return all dupes add parameter keep=False:

df = pd.DataFrame({'user_id':list('aaacbbt')})

df['user_type'] = np.where(df['user_id'].duplicated(keep=False), 'Frequent','Single')

Alternative:

d = {True:'Frequent',False:'Single'}
df['user_type'] = df['user_id'].duplicated(keep=False).map(d)

print (df)
  user_id user_type
0       a  Frequent
1       a  Frequent
2       a  Frequent
3       c    Single
4       b  Frequent
5       b  Frequent
6       t    Single

EDIT:

df = pd.DataFrame({'user_id':list('aaacbbt')})
print (df)
  user_id
0       a
1       a
2       a
3       c
4       b
5       b
6       t

Here drop_duplicates remove all duplicates row by column user_id and return only first row (default parameter is keep='first'):

df_single = df.drop_duplicates('user_id')
print (df_single)
  user_id
0       a
3       c
4       b
6       t

But Series.duplicated return Trues for all dupes without first:

print (df['user_id'].duplicated())
0    False
1     True
2     True
3    False
4    False
5     True
6    False
Name: user_id, dtype: bool

df_freq = df[df['user_id'].duplicated()]
print (df_freq)
  user_id
1       a
2       a
5       b

Upvotes: 4

BENY

Reputation: 323326

Data from Jez , method involve value_counts

df.user_id.map(df.user_id.value_counts().gt(1).replace({True:'Frequent',False:'Single'}))
Out[52]: 
0    Frequent
1    Frequent
2    Frequent
3      Single
4    Frequent
5    Frequent
6      Single
Name: user_id, dtype: object

Upvotes: 2

piRSquared

Reputation: 294488

Using jezrael's data

df = pd.DataFrame({'user_id':list('aaacbbt')})

You can use array slicing

df.assign(
    user_type=
    np.array(['Single', 'Frequent'])[
        df['user_id'].duplicated(keep=False).astype(int)
    ]
)

  user_id user_type
0       a  Frequent
1       a  Frequent
2       a  Frequent
3       c    Single
4       b  Frequent
5       b  Frequent
6       t    Single

Upvotes: 2

How to make a dummy column determining whether a column cell value is a duplicate?

Answers (3)

Related Questions