Amenhotep
Amenhotep

Reputation: 919

How to label duplicate groups in pandas?

I have a DataFrame:

>>> df
     A
0  foo
1  bar
2  foo
3  baz
4  foo
5  bar

I need to find all the duplicate groups and label them with sequential dgroup_id's:

>>> df
     A  dgroup_id
0  foo          1
1  bar          2
2  foo          1
3  baz
4  foo          1
5  bar          2

(This means that foo belongs to the first group of duplicates, bar to the second group of duplicates, and baz is not duplicated.)

I did this:

import pandas as pd

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})

duplicates = df.groupby('A').size()
duplicates = duplicates[duplicates>1]
# Yes, this is ugly, but I didn't know how to do it otherwise:
duplicates[duplicates.reset_index().index] = duplicates.reset_index().index
df.insert(1, 'dgroup_id', df['A'].map(duplicates))

This leads to:

>>> df
     A  dgroup_id
0  foo        1.0
1  bar        0.0
2  foo        1.0
3  baz        NaN
4  foo        1.0
5  bar        0.0

Is there a simpler/shorter way to achieve this in pandas? I read that maybe pandas.factorize could be of help here, but I don't know how to use it... (the pandas documentation on this function is of no help)

Also: I don't mind neither the 0-based group count, nor the weird sorting order; but I would like to have the dgroup_id's as ints, not floats.

Upvotes: 2

Views: 3334

Answers (5)

Jinyao
Jinyao

Reputation: 1

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
key_set = set(df['A'])
df_a = pd.DataFrame(list(key_set))
df_a['dgroup_id'] = df_a.index
result = pd.merge(df,df_a,left_on='A',right_on=0,how='left')

In [32]: result.drop(0,axis=1)
Out[32]:
     A  dgroup_id
0  foo        2
1  bar        0
2  foo        2
3  baz        1
4  foo        2
5  bar        0

Upvotes: 0

piRSquared
piRSquared

Reputation: 294278

Use duplicated to identify where dups are. Use where to replace singletons with ''. Use categorical to factorize.

dups = df.A.duplicated(keep=False)
df.assign(dgroup_id=df.A.where(dups, '').astype('category').cat.codes)

     A  dgroup_id
0  foo          2
1  bar          1
2  foo          2
3  baz          0
4  foo          2
5  bar          1

If you insist on the zeros being ''

dups = df.A.duplicated(keep=False)
df.assign(
    dgroup_id=df.A.where(dups, '').astype('category').cat.codes.replace(0, ''))

     A dgroup_id
0  foo         2
1  bar         1
2  foo         2
3  baz          
4  foo         2
5  bar         1

Upvotes: 1

Deena
Deena

Reputation: 6213

You could go for:

import pandas as pd
import numpy as np
df = pd.DataFrame(['foo', 'bar', 'foo', 'baz', 'foo', 'bar',], columns=['name'])

# Create the groups order
ordered_names = df['name'].drop_duplicates().tolist()   # ['foo', 'bar', 'baz']

# Find index of each element in the ordered list
df['duplication_index'] = df['name'].apply(lambda x: ordered_names.index(x) + 1)

# Discard non-duplicated entries
df.loc[~df['name'].duplicated(keep=False), 'duplication_index'] = np.nan

print(df)
#   name  duplication_index
# 0  foo                1.0
# 1  bar                2.0
# 2  foo                1.0
# 3  baz                NaN
# 4  foo                1.0
# 5  bar                2.0

Upvotes: 0

Allen Qin
Allen Qin

Reputation: 19947

Use chained operation to first get value_count for each A, calculate the sequence number for each group, and then join back to the original DF.

(
    pd.merge(df,
             df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
               .cumsum().rename('dgroup_id').to_frame(), 
             left_on='A', right_index=True).sort_index()
)
Out[49]: 
     A  dgroup_id
0  foo        1.0
1  bar        2.0
2  foo        1.0
3  baz        NaN
4  foo        1.0
5  bar        2.0

If you need Nan for unique groups, you can't have int as the datatype which is a pandas limitation at the moment. If you are ok with set 0 for unique groups, you can do something like:

(
    pd.merge(df,
             df.A.value_counts().apply(lambda x: 1 if x>1 else np.nan)
               .cumsum().rename('dgroup_id').to_frame().fillna(0).astype(int), 
             left_on='A', right_index=True).sort_index()
)

     A  dgroup_id
0  foo          1
1  bar          2
2  foo          1
3  baz          0
4  foo          1
5  bar          2

Upvotes: 1

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can make a list of duplicates by get_duplicates() then set the dgroup_id by A's index

def find_index(string):
    if string in duplicates:
        return duplicates.index(string)+1
    else:
        return 0

df = pd.DataFrame({'A': ('foo', 'bar', 'foo', 'baz', 'foo', 'bar')})
duplicates = df.set_index('A').index.get_duplicates()
df['dgroup_id'] = df['A'].apply(find_index)
df

Output:

     A  dgroup_id
0  foo          2
1  bar          1
2  foo          2
3  baz          0
4  foo          2
5  bar          1
​

Upvotes: 4

Related Questions