Reputation: 81
I have a pandas DataFrame with pairs of names in 'name_x' and 'name_y' columns and an associated id:
id name_x name_y
0 104 molly james
1 104 james molly
2 104 sarah adam
3 236 molly adam
4 388 adam sarah
5 388 johnny pete
6 104 adam sarah
7 236 adam james
8 236 pete johnny
I would like to remove 'duplicate' rows where the id numbers are the same and both names have appeared together in either name column. eg.
Such that the row with index 1 is removed because the pair of names 'molly' and 'james' have already appeared with id 104. Similarly the row with index 6 is removed as the pair of names 'adam' and 'sarah' have already appeared with id 104 so that the DataFrame looks like this:
id name_x name_y
0 104 molly james
1 104 sarah adam
2 236 molly adam
3 388 adam sarah
4 388 johnny pete
5 236 adam james
6 236 pete johnny
(The ordering of the names does not matter)
I would then like to be able to create another DataFrame which displays the count of pairs of names depending on how many times they appear with different id's and those id's eg:
count ids name_x name_y
0 1 104 molly james
1 2 [104, 388] sarah adam
2 1 236 molly adam
3 2 [388, 236] johnny pete
4 1 236 adam james
I am new to programming/python/pandas and have yet to find an answer for this! Thanks!
Upvotes: 1
Views: 584
Reputation: 862741
You can use:
names
set
s and then to list
slist
s by len
mask
with indexing with str for scalar for one item list
sdf[['name_x','name_y']] = np.sort( df[['name_x','name_y']], axis=1)
df=df.groupby(['name_x','name_y'])['id'].apply(lambda x:list(set(x))).reset_index(name='ids')
df['count'] = df['ids'].str.len()
print (df)
name_x name_y ids count
0 adam james [236] 1
1 adam molly [236] 1
2 adam sarah [104, 388] 2
3 james molly [104] 1
4 johnny pete [388, 236] 2
df['ids'] = df['ids'].mask(df['count'] == 1, df['ids'].str[0])
print (df)
name_x name_y ids count
0 adam james 236 1
1 adam molly 236 1
2 adam sarah [104, 388] 2
3 james molly 104 1
4 johnny pete [388, 236] 2
Upvotes: 1