Reputation: 449
I have a dataframe with 3 columns in Python:
Name1 Name2 Value
Juan Ale 1
Ale Juan 1
and would like to eliminate the duplicates based on columns Name1 and Name2 combinations.
In my example both rows are equal (but they are in different order), and I would like to delete the second row and just keep the first one, so the end result should be:
Name1 Name2 Value
Juan Ale 1
Any idea will be really appreciated!
Upvotes: 22
Views: 9013
Reputation: 164623
You can convert to frozenset
and use pd.DataFrame.duplicated
.
res = df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
print(res)
Name1 Name2 Value
0 Juan Ale 1
frozenset
is necessary instead of set
since duplicated
uses hashing to check for duplicates.
Scales better with columns than rows. For a large number of rows, use @Wen's sort-based algorithm.
Upvotes: 23
Reputation: 59274
Know Im kinda late for this question but giving my contribution anyway :)
You can also use get_dummies
and add
for a good way of creating hashable rows
df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
Times are not as good as @Wen's answer, but it isstill way faster than apply
+frozen_set
df=pd.concat([df]*1000000)
%timeit df[~(pd.get_dummies(df.a).add(pd.get_dummies(df.b), fill_value=0)).duplicated()]
1.8 s ± 85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df[pd.DataFrame(np.sort(df[['a','b']].values,1)).duplicated()]
1.26 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df[~df[['a', 'b']].apply(frozenset, axis=1).duplicated()]
1min 9s ± 684 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 7
Reputation: 323226
By using np.sort
with duplicated
df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
Out[614]:
Name1 Name2 Value
1 Ale Juan 1
Performance
df=pd.concat([df]*100000)
%timeit df[pd.DataFrame(np.sort(df[['Name1','Name2']].values,1)).duplicated()]
10 loops, best of 3: 69.3 ms per loop
%timeit df[~df[['Name1', 'Name2']].apply(frozenset, axis=1).duplicated()]
1 loop, best of 3: 3.72 s per loop
Upvotes: 28