Reputation: 725
I wondered if there is a way to check and then drop certain rows which are not unique?
My data frame looks something like this:
ID1 ID2 weight
0 2 4 0.5
1 3 7 0.8
2 4 2 0.5
3 7 3 0.8
4 8 2 0.5
5 3 8 0.5
EDIT: I added a couple more rows to show that other unique rows that may have the same weight should be kept.
I think that when I use pandas drop_duplicates(subset=['ID1', 'ID2','weight'], keep=False)
it considers each row individually but not recognise that rows 0 and 2 and 1 and 4 are in fact the same values?
Upvotes: 5
Views: 2568
Reputation: 71687
Sort the dataframe along axis=1
then use np.unique
with optional param return_index=True
to get the indices of unique elements:
sub = ['ID1', 'ID2', 'weight']
idx = np.unique(np.sort(df[sub], 1), axis=0, return_index=True)[1]
df1 = df.iloc[sorted(idx)]
Alternative approach suggested by @anky:
df1 = df[~pd.DataFrame(np.sort(df[sub], 1), index=df.index).duplicated()]
print(df1)
ID1 ID2 weight
0 2 4 0.5
1 3 7 0.8
4 8 2 0.5
5 3 8 0.5
Upvotes: 4
Reputation: 5026
This works, but it's kind of hacky. Create sets from columns that should be pairs and convert to tuples to get hashable types
df['new'] = df[['ID1','ID2']].apply(lambda x: tuple(set(x)), axis=1)
df.drop_duplicates(subset=['new','weight'], keep=False)
Out:
ID1 ID2 weight new
4 8 2 0.5 (8, 2)
5 3 8 0.5 (8, 3)
Upvotes: 1