Reputation: 136267
I want to add a column _duplicate_list
which contains all of the duplicates. I can get the duplicates with a looping solution (which could probably be nicer).
What I cannot get to work easily is assigning the same list to a couple of elements.
import pandas as pd
import numpy as np
def example_df():
"""Create an example dataframe."""
country_names = ['Germany',
'France',
'Indonesia',
'Ireland',
'Spain',
'Vatican']
group = [1, 1, 0, 1, 1, 1]
df = pd.DataFrame({'country': country_names,
'group': group})
df = df[['country', 'group']]
return df
df = example_df()
df['_duplicate_list'] = np.empty((len(df), 0)).tolist()
# This needs to be changed
for group_val in df['group'].unique().tolist():
df.loc[df['group'] == group_val, ['_duplicate_list']] = df['country'][df['group'] == group_val].tolist()
actual output:
country group _duplicate_list
0 Germany 1 Germany
1 France 1 France
2 Indonesia 0 Indonesia
3 Ireland 1 Ireland
4 Spain 1 Spain
5 Vatican 1 Vatican
desired output
country group _duplicate_list
0 Germany 1 ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
1 France 1 ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
2 Indonesia 0 ['Indonesia']
3 Ireland 1 ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
4 Spain 1 ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
5 Vatican 1 ['Germany', 'France', 'Ireland', 'Spain', 'Vatican']
Upvotes: 0
Views: 80
Reputation: 210832
In [66]: df["_duplicate_list"] = \
df["group"].map(df.groupby("group")["country"].apply(list))
In [67]: df
Out[67]:
country group _duplicate_list
0 Germany 1 [Germany, France, Ireland, Spain, Va...
1 France 1 [Germany, France, Ireland, Spain, Va...
2 Indonesia 0 [Indonesia]
3 Ireland 1 [Germany, France, Ireland, Spain, Va...
4 Spain 1 [Germany, France, Ireland, Spain, Va...
5 Vatican 1 [Germany, France, Ireland, Spain, Va...
Upvotes: 3
Reputation: 3770
df['duplicate_list'] = df.apply(lambda x: df[df['group'] == x.group]['country'].tolist(), axis=1)
OR
df['duplicate_list'] = df.apply(lambda x: list(filter(None,np.where(df['group'] == x.group, df['country'],None))), axis=1)
Output
country group _duplicate_list \
0 Germany 1 Germany
1 France 1 France
2 Indonesia 0 Indonesia
3 Ireland 1 Ireland
4 Spain 1 Spain
5 Vatican 1 Vatican
duplicate_list
0 [Germany, France, Ireland, Spain, Vatican]
1 [Germany, France, Ireland, Spain, Vatican]
2 [Indonesia]
3 [Germany, France, Ireland, Spain, Vatican]
4 [Germany, France, Ireland, Spain, Vatican]
5 [Germany, France, Ireland, Spain, Vatican]
Upvotes: 0
Reputation: 25239
I think of transform
with unique
df['_duplicate_list'] = df.groupby('group').country.transform('unique')
Out[810]:
country group _duplicate_list
0 Germany 1 [Germany, France, Ireland, Spain, Vatican]
1 France 1 [Germany, France, Ireland, Spain, Vatican]
2 Indonesia 0 [Indonesia]
3 Ireland 1 [Germany, France, Ireland, Spain, Vatican]
4 Spain 1 [Germany, France, Ireland, Spain, Vatican]
5 Vatican 1 [Germany, France, Ireland, Spain, Vatican]
Upvotes: 2