Reputation: 919
I have a DataFrame:
>>> df
A
0 foo
1 bar
2 foo
3 baz
4 foo
5 bar
I need to find all the duplicate groups and label them with sequential dgroup_id
's:
>>> df
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz
4 foo 1
5 bar 2
(This means that foo
belongs to the first group of duplicates, bar
to the second group of duplicates, and baz
is not duplicated.)
I did this:
import pandas as pd
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))
This leads to:
>>> df
A dgroup_id
0 foo 1.0
1 bar 0.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 0.0
Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize
could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)
Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id
's as ints, not floats.
Upvotes: 2
Views: 3334
Reputation: 1
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')
In [32]: result.drop(0,axis=1)
Out[32]:
A dgroup_id
0 foo 2
1 bar 0
2 foo 2
3 baz 1
4 foo 2
5 bar 0
Upvotes: 0
Reputation: 294278
Use duplicated
to identify where dups are. Use where
to replace singletons with ''
. Use categorical to factorize.
dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz 0
4 foo 2
5 bar 1
If you insist on the zeros being ''
dups = df.A.duplicated(keep=False)
df.assign(
dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))
A dgroup_id
0 foo 2
1 bar 1
2 foo 2
3 baz
4 foo 2
5 bar 1
Upvotes: 1
Reputation: 6213
You could go for:
import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])
# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist() # ['foo', 'bar', 'baz']
# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)
# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan
print(df)
# name duplication_index
# 0 foo 1.0
# 1 bar 2.0
# 2 foo 1.0
# 3 baz NaN
# 4 foo 1.0
# 5 bar 2.0
Upvotes: 0
Reputation: 19947
Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame(),
left_on='A', right_index=True).sort_index()
)
Out[49]:
A dgroup_id
0 foo 1.0
1 bar 2.0
2 foo 1.0
3 baz NaN
4 foo 1.0
5 bar 2.0
If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:
(
pd.merge(df,
df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
.cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int),
left_on='A', right_index=True).sort_index()
)
A dgroup_id
0 foo 1
1 bar 2
2 foo 1
3 baz 0
4 foo 1
5 bar 2
Upvotes: 1
Reputation: 30605
You can make a list
of duplicates by get_duplicates()
then set the dgroup_id
by A
's index
def find_index(string):
if string in duplicates:
return duplicates.index(string)+1
else:
return 0
df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df
Output:
A dgroup_id 0 foo 2 1 bar 1 2 foo 2 3 baz 0 4 foo 2 5 bar 1
Upvotes: 4