Reputation: 21961
I can use pandas
dropna()
functionality to remove rows with some or all columns set as NA
's. Is there an equivalent function for dropping rows with all columns having value 0?
P kt b tt mky depth
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 1.1 3 4.5 2.3 9.0
In this example, we would like to drop the first 4 rows from the data frame.
thanks!
Upvotes: 193
Views: 484786
Reputation: 1
from io import StringIO
import pandas as pd
s = '''
P kt b tt mky depth
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 1.1 3 4.5 2.3 9.0
'''
df = pd.read_csv(StringIO(s), sep=r'\s+', engine='python',index_col=0)
print(df)
print()
print(
df.where(df != 0).dropna(how='all')
)
Upvotes: 0
Reputation: 1288
Accessing only the indices that are TRUE for the row sum > 0 should suffice:
ndf=df[df.sum(axis=1)>0]
Upvotes: 0
Reputation: 763
Following the example in the accepted answer, a more elegant solution:
df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
df = df[df.any(axis=1)]
print(df)
a b
1 0 1
2 1 0
3 1 1
Upvotes: 5
Reputation: 21
For me this code: df.loc[(df!=0).any(axis=0)]
did not work. It returned the exact dataset.
Instead, I used df.loc[:, (df!=0).any(axis=0)]
and dropped all the columns with 0 values in the dataset
The function .all()
droped all the columns in which are any zero values in my dataset.
Upvotes: 1
Reputation: 307
Replace the zeros with nan
and then drop the rows with all entries as nan
.
After that replace nan
with zeros.
import numpy as np
df = df.replace(0, np.nan)
df = df.dropna(how='all', axis=0)
df = df.replace(np.nan, 0)
Upvotes: 29
Reputation: 314
To drop all columns with values 0 in any row:
new_df = df[df.loc[:]!=0].dropna()
Upvotes: -2
Reputation: 46
df = df [~( df [ ['kt' 'b' 'tt' 'mky' 'depth', ] ] == 0).all(axis=1) ]
Try this command its perfectly working.
Upvotes: 0
Reputation: 857
Another alternative:
# Is there anything in this row non-zero?
# df != 0 --> which entries are non-zero? T/F
# (df != 0).any(axis=1) --> are there 'any' entries non-zero row-wise? T/F of rows that return true to this statement.
# df.loc[all_zero_mask,:] --> mask your rows to only show the rows which contained a non-zero entry.
# df.shape to confirm a subset.
all_zero_mask=(df != 0).any(axis=1) # Is there anything in this row non-zero?
df.loc[all_zero_mask,:].shape
Upvotes: 3
Reputation: 1366
Couple of solutions I found to be helpful while looking this up, especially for larger data sets:
df[(df.sum(axis=1) != 0)] # 30% faster
df[df.values.sum(axis=1) != 0] # 3X faster
Continuing with the example from @U2EF1:
In [88]: df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
In [91]: %timeit df[(df.T != 0).any()]
1000 loops, best of 3: 686 µs per loop
In [92]: df[(df.sum(axis=1) != 0)]
Out[92]:
a b
1 0 1
2 1 0
3 1 1
In [95]: %timeit df[(df.sum(axis=1) != 0)]
1000 loops, best of 3: 495 µs per loop
In [96]: %timeit df[df.values.sum(axis=1) != 0]
1000 loops, best of 3: 217 µs per loop
On a larger dataset:
In [119]: bdf = pd.DataFrame(np.random.randint(0,2,size=(10000,4)))
In [120]: %timeit bdf[(bdf.T != 0).any()]
1000 loops, best of 3: 1.63 ms per loop
In [121]: %timeit bdf[(bdf.sum(axis=1) != 0)]
1000 loops, best of 3: 1.09 ms per loop
In [122]: %timeit bdf[bdf.values.sum(axis=1) != 0]
1000 loops, best of 3: 517 µs per loop
Upvotes: 12
Reputation: 31898
I look up this question about once a month and always have to dig out the best answer from the comments:
df.loc[(df!=0).any(1)]
Thanks Dan Allan!
Upvotes: 35
Reputation: 13768
One-liner. No transpose needed:
df.loc[~(df==0).all(axis=1)]
And for those who like symmetry, this also works...
df.loc[(df!=0).any(axis=1)]
Upvotes: 230
Reputation: 13251
It turns out this can be nicely expressed in a vectorized fashion:
> df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
> df = df[(df.T != 0).any()]
> df
a b
1 0 1
2 1 0
3 1 1
Upvotes: 149
Reputation: 86128
import pandas as pd
df = pd.DataFrame({'a' : [0,0,1], 'b' : [0,0,-1]})
temp = df.abs().sum(axis=1) == 0
df = df.drop(temp)
Result:
>>> df
a b
2 1 -1
Upvotes: 5
Reputation: 13768
You can use a quick lambda
function to check if all the values in a given row are 0
. Then you can use the result of applying that lambda
as a way to choose only the rows that match or don't match that condition:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3),
index=['one', 'two', 'three', 'four', 'five'],
columns=list('abc'))
df.loc[['one', 'three']] = 0
print df
print df.loc[~df.apply(lambda row: (row==0).all(), axis=1)]
Yields:
a b c
one 0.000000 0.000000 0.000000
two 2.240893 1.867558 -0.977278
three 0.000000 0.000000 0.000000
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[5 rows x 3 columns]
a b c
two 2.240893 1.867558 -0.977278
four 0.410599 0.144044 1.454274
five 0.761038 0.121675 0.443863
[3 rows x 3 columns]
Upvotes: 7