Reputation: 466
I have a pandas dataframe that I'm trying to drop rows based on all columns having exact same value. Here's an example to help understand the idea.
Input:
index A B C D E F ....
0 1 2 3 1 3 4
1 2 2 2 2 2 2
2 5 5 5 5 5 5
3 7 7 6 7 7 7
Output:
index A B C D E F ....
0 1 2 3 1 3 4
3 7 7 6 7 7 7
There can be many columns here.
Upvotes: 5
Views: 2086
Reputation:
An efficient way of doing this with numeric DataFrames is to use the standard deviation (which will be 0 only if all values are the same):
df[df.std(axis=1) > 0]
Out:
A B C D E F
0 1 2 3 1 3 4
3 7 7 6 7 7 7
As tgrandje points out, due to floating point inaccuracy the standard deviation may not be exactly zero. You can instead use np.isclose
for a more robust approach:
df[~np.isclose(df.std(axis=1), 0)]
which results in the same answer.
Timings with 40k rows:
%timeit df[df.std(axis=1) > 0]
1000 loops, best of 3: 1.69 ms per loop
%timeit df[df.nunique(1) > 1]
1 loop, best of 3: 2.62 s per loop
Upvotes: 12
Reputation: 323266
Using nunique
df=df[df.nunique(1)>1]
df
Out[286]:
A B C D E F
index
0 1 2 3 1 3 4
3 7 7 6 7 7 7
Upvotes: 5
Reputation: 210842
Yet another efficient (well not that fast as @ayhan's solution) way:
In [17]: df[~df.eq(df.iloc[:, 0], axis=0).all(1)]
Out[17]:
A B C D E F
index
0 1 2 3 1 3 4
3 7 7 6 7 7 7
Timing for 40.000 rows DF:
In [19]: df.shape
Out[19]: (40000, 6)
In [20]: %timeit df[df.std(axis=1) > 0]
5.62 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: %timeit df[df.nunique(1)>1]
9.87 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [23]: %timeit df[~df.eq(df.iloc[:, 0], axis=0).all(1)]
13 ms ± 86.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 3