Remove rows from pandas DataFrame if multiple columns contain the same data, but interchanged

Question

I have a pandas DataFrame with pairs of names in 'name_x' and 'name_y' columns and an associated id:

    id  name_x  name_y
0   104 molly   james
1   104 james   molly
2   104 sarah   adam
3   236 molly   adam
4   388 adam    sarah
5   388 johnny  pete
6   104 adam    sarah
7   236 adam    james
8   236 pete    johnny

I would like to remove 'duplicate' rows where the id numbers are the same and both names have appeared together in either name column. eg.

Such that the row with index 1 is removed because the pair of names 'molly' and 'james' have already appeared with id 104. Similarly the row with index 6 is removed as the pair of names 'adam' and 'sarah' have already appeared with id 104 so that the DataFrame looks like this:

    id  name_x  name_y
0   104 molly   james
1   104 sarah   adam
2   236 molly   adam
3   388 adam    sarah
4   388 johnny  pete
5   236 adam    james
6   236 pete    johnny

(The ordering of the names does not matter)

I would then like to be able to create another DataFrame which displays the count of pairs of names depending on how many times they appear with different id's and those id's eg:

    count   ids        name_x   name_y
0   1       104        molly    james
1   2       [104, 388] sarah    adam
2   1       236        molly    adam
3   2       [388, 236] johnny   pete
4   1       236        adam     james

I am new to programming/python/pandas and have yet to find an answer for this! Thanks!

jezrael · Accepted Answer

You can use:

first sort columns with names
groupby, convert to sets and then to lists
get length of lists by len
last if necessary use mask with indexing with str for scalar for one item lists

df[['name_x','name_y']] = np.sort( df[['name_x','name_y']], axis=1)

df=df.groupby(['name_x','name_y'])['id'].apply(lambda x:list(set(x))).reset_index(name='ids')
df['count'] = df['ids'].str.len()
print (df)
   name_x name_y         ids  count
0    adam  james       [236]      1
1    adam  molly       [236]      1
2    adam  sarah  [104, 388]      2
3   james  molly       [104]      1
4  johnny   pete  [388, 236]      2

df['ids'] = df['ids'].mask(df['count'] == 1, df['ids'].str[0])
print (df)
   name_x name_y         ids  count
0    adam  james         236      1
1    adam  molly         236      1
2    adam  sarah  [104, 388]      2
3   james  molly         104      1
4  johnny   pete  [388, 236]      2

Remove rows from pandas DataFrame if multiple columns contain the same data, but interchanged

Answers (1)

Related Questions