Martin Thoma
Martin Thoma

Reputation: 136267

How can I add the IDs of duplicates to each of the elements?

I want to add a column _duplicate_list which contains all of the duplicates. I can get the duplicates with a looping solution (which could probably be nicer).

What I cannot get to work easily is assigning the same list to a couple of elements.

Example

import pandas as pd
import numpy as np

def example_df():
    """Create an example dataframe."""
    country_names = ['Germany',
                     'France',
                     'Indonesia',
                     'Ireland',
                     'Spain',
                     'Vatican']
    group = [1, 1, 0, 1, 1, 1]
    df = pd.DataFrame({'country': country_names,
                       'group': group})
    df = df[['country', 'group']]
    return df

df = example_df()
df['_duplicate_list'] = np.empty((len(df), 0)).tolist()

# This needs to be changed
for group_val in df['group'].unique().tolist():
    df.loc[df['group'] == group_val, ['_duplicate_list']] = df['country'][df['group'] == group_val].tolist()

actual output:

     country  group _duplicate_list
0    Germany      1         Germany
1     France      1          France
2  Indonesia      0       Indonesia
3    Ireland      1         Ireland
4      Spain      1           Spain
5    Vatican      1         Vatican

desired output

     country  group _duplicate_list
0    Germany      1  ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
1     France      1  ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
2  Indonesia      0  ['Indonesia']
3    Ireland      1  ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
4      Spain      1  ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
5    Vatican      1  ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']

Upvotes: 0

Views: 80

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

In [66]: df["_duplicate_list"] = \
             df["group"].map(df.groupby("group")["country"].apply(list))

In [67]: df
Out[67]:
     country  group                          _duplicate_list
0    Germany      1  [Germany, France, Ireland, Spain, Va...
1     France      1  [Germany, France, Ireland, Spain, Va...
2  Indonesia      0                              [Indonesia]
3    Ireland      1  [Germany, France, Ireland, Spain, Va...
4      Spain      1  [Germany, France, Ireland, Spain, Va...
5    Vatican      1  [Germany, France, Ireland, Spain, Va...

Upvotes: 3

iamklaus
iamklaus

Reputation: 3770

df['duplicate_list'] = df.apply(lambda x: df[df['group'] == x.group]['country'].tolist(), axis=1)

OR

df['duplicate_list'] =  df.apply(lambda x: list(filter(None,np.where(df['group'] == x.group, df['country'],None))), axis=1)

Output

     country  group _duplicate_list  \
0    Germany      1         Germany   
1     France      1          France   
2  Indonesia      0       Indonesia   
3    Ireland      1         Ireland   
4      Spain      1           Spain   
5    Vatican      1         Vatican   

                               duplicate_list  
0  [Germany, France, Ireland, Spain, Vatican]  
1  [Germany, France, Ireland, Spain, Vatican]  
2                                 [Indonesia]  
3  [Germany, France, Ireland, Spain, Vatican]  
4  [Germany, France, Ireland, Spain, Vatican]  
5  [Germany, France, Ireland, Spain, Vatican]  

Upvotes: 0

Andy L.
Andy L.

Reputation: 25239

I think of transform with unique

df['_duplicate_list'] = df.groupby('group').country.transform('unique')

Out[810]:
     country  group                             _duplicate_list
0    Germany      1  [Germany, France, Ireland, Spain, Vatican]
1     France      1  [Germany, France, Ireland, Spain, Vatican]
2  Indonesia      0                                 [Indonesia]
3    Ireland      1  [Germany, France, Ireland, Spain, Vatican]
4      Spain      1  [Germany, France, Ireland, Spain, Vatican]
5    Vatican      1  [Germany, France, Ireland, Spain, Vatican]

Upvotes: 2

Related Questions