Gregory Arenius
Gregory Arenius

Reputation: 3204

How do I drop rows from a Pandas dataframe based on data in multiple columns?

I know how to delete rows based on simple criteria like in this stack overflow question, however, I need to delete rows using more complex criteria.

My situation: I have rows of data where each row has four columns containing numeric codes. I need to drop all rows that don't have at least one code with a leading digit of less than 5. I've currently got a function that I can use with dataframe.apply that creates a new column, 'keep', and populates it with 1 if it is a row to keep. I then do a second pass using that simple keep column to delete unwanted rows. What I'm looking for is a way to do this in a single pass without having to create a new column.

Example Data:

   a | b | c | d
0 145|567|999|876
1 999|876|543|543

In that data I would like to keep the first row because in column 'a' the leading digit is less than 5. The second row has no columns with a leading digit of less than 5, so that row needs to be dropped.

Upvotes: 3

Views: 2510

Answers (1)

EdChum
EdChum

Reputation: 394003

This should work:

In [31]:
df[(df.apply(lambda x: x.str[0].astype(int))).lt(5).any(axis=1)]

Out[31]:
     a    b    c    d
0  145  567  999  876

So basically this takes the first character of each column using the vectorised str method, we cast this to an int, we then call lt which is less than row-wise to produce a boolean df, we then call any on the df row-wise to produce a boolean mask on the index which use to mask the df. So breaking the above down:

In [34]:
df.apply(lambda x: x.str[0].astype(int))

Out[34]:
   a  b  c  d
0  1  5  9  8
1  9  8  5  5

In [35]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5)

Out[35]:
       a      b      c      d
0   True  False  False  False
1  False  False  False  False

In [37]:    
df.apply(lambda x: x.str[0].astype(int)).lt(5).any(axis=1)

Out[37]:
0     True
1    False
dtype: bool

EDIT

To handle NaN values you add a call to dropna:

In [39]:
t="""a,b,c,d
0,145,567,999,876
1,999,876,543,543
2,,324,344"""
df = pd.read_csv(io.StringIO(t),dtype=str)
df

Out[39]:
     a    b    c    d
0  145  567  999  876
1  999  876  543  543
2  NaN  324  344  NaN

In [44]:
df[(df.apply(lambda x: x.dropna().str[0].astype(int))).lt(5,axis=0).any(axis=1)]

Out[44]:
     a    b    c    d
0  145  567  999  876
2  NaN  324  344  NaN

Upvotes: 4

Related Questions