Reputation: 392
I need to delete duplicated rows based on combination of two columns (person1 and person2 columns) which have strings. For example person1: ryan and person2: delta or person 1: delta and person2: ryan is same and provides the same value in messages column. Need to drop one of these two rows. Return the non duplicated rows as well.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
df
person1 person2 messages
0 0 ryan delta 1
1 1 delta ryan 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 bravo delta 3
5 5 alpha ryan 9
6 6 ryan alpha 9
Answer df should be:
finaldf
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Upvotes: 6
Views: 2479
Reputation: 14109
Try as follows:
res = (df[~df.filter(like='person').apply(frozenset, axis=1).duplicated()]
.reset_index(drop=True))
print(res)
person1 person2 messages
0 0 ryan delta 1
1 2 delta alpha 2
2 3 delta bravo 3
3 5 alpha ryan 9
Explanation
df.filter
to select just the columns with person*
.df.apply
to turn each row (axis=1
) into a frozenset
. So, at this stage, we are looking at a pd.Series
like this:0 (ryan, delta)
1 (ryan, delta)
2 (alpha, delta)
3 (bravo, delta)
4 (bravo, delta)
5 (alpha, ryan)
6 (alpha, ryan)
dtype: object
Series.duplicated
and add ~
as a prefix to the resulting boolean series to select the inverse from the original df
.df.reset_index
.Upvotes: 5
Reputation: 405745
You can put the two person columns in order within each row, then drop duplicates.
import pandas as pd
df = pd.DataFrame({"": [0,1,2,3,4,5,6],
"person1": ["ryan", "delta", "delta", "delta","bravo","alpha","ryan"],
"person2": ["delta", "ryan", "alpha", "bravo","delta","ryan","alpha"],
"messages": [1, 1, 2, 3,3,9,9]})
print(df)
swap = df['person1'] < df['person2']
df.loc[swap, ['person1', 'person2']] = df.loc[swap, ['person2', 'person1']].values
df = df.drop_duplicates(subset=['person1', 'person2'])
print(df)
After the swap:
person1 person2 messages
0 0 ryan delta 1
1 1 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
4 4 delta bravo 3
5 5 ryan alpha 9
6 6 ryan alpha 9
After dropping duplicates:
person1 person2 messages
0 0 ryan delta 1
2 2 delta alpha 2
3 3 delta bravo 3
5 5 ryan alpha 9
Upvotes: 2
Reputation: 4548
Here's a less general approach than the one given by @ouroboros1, this only works for your two columns case
#make a Series of strings of min of p1/p2 concat to max of p1/p2
sorted_p1p2 = df[['person1','person2']].min(axis=1)+'_'+df[['person1','person2']].max(axis=1)
#subset to non-dup from the Series
dedup_df = df[~sorted_p1p2.duplicated()]
Upvotes: 2