Reputation: 35
In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie:
ZipCode Age A B C D
12345 21 0 1 1 1
12345 22 1 0 1 4
23456 45 1 0 1 1
23456 21 3 1 0 0
I want to delete all rows in which 0 or 1 doesn't appear in columns A,B,C, or D as a way to clean up the data. In this case, I would remove the 2nd and 4th row because 4 appears in column D in row 2 and 3 appears in column A in row 4. I want to do this even if I have 100 columns to check such that I don't have to look up every column one by one in my conditional statement. How would I do this?
Upvotes: 0
Views: 1275
Reputation: 31692
Other two solutions works well but if you interested in speed you should look at numpy in1d
function:
data=df.loc[:, 'A':]
In [72]: df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Timing:
In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of 3: 558 us per loop
In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of 3: 843 us per loop
In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of 3: 1.44 ms per loop
Upvotes: 0
Reputation: 31181
You can opt for a vectorized solution:
In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
Out[64]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Upvotes: 2