Reputation: 3204
I know how to delete rows based on simple criteria like in this stack overflow question, however, I need to delete rows using more complex criteria.
My situation: I have rows of data where each row has four columns containing numeric codes. I need to drop all rows that don't have at least one code with a leading digit of less than 5. I've currently got a function that I can use with dataframe.apply that creates a new column, 'keep', and populates it with 1 if it is a row to keep. I then do a second pass using that simple keep column to delete unwanted rows. What I'm looking for is a way to do this in a single pass without having to create a new column.
Example Data:
a | b | c | d
0 145|567|999|876
1 999|876|543|543
In that data I would like to keep the first row because in column 'a' the leading digit is less than 5. The second row has no columns with a leading digit of less than 5, so that row needs to be dropped.
Upvotes: 3
Views: 2510
Reputation: 394003
This should work:
In [31]:
df[(df.apply(lambda x: x.str[0].astype(int))).lt(5).any(axis=1)]
Out[31]:
a b c d
0 145 567 999 876
So basically this takes the first character of each column using the vectorised str
method, we cast this to an int, we then call lt
which is less than row-wise to produce a boolean df, we then call any
on the df row-wise to produce a boolean mask on the index which use to mask the df. So breaking the above down:
In [34]:
df.apply(lambda x: x.str[0].astype(int))
Out[34]:
a b c d
0 1 5 9 8
1 9 8 5 5
In [35]:
df.apply(lambda x: x.str[0].astype(int)).lt(5)
Out[35]:
a b c d
0 True False False False
1 False False False False
In [37]:
df.apply(lambda x: x.str[0].astype(int)).lt(5).any(axis=1)
Out[37]:
0 True
1 False
dtype: bool
EDIT
To handle NaN
values you add a call to dropna
:
In [39]:
t="""a,b,c,d
0,145,567,999,876
1,999,876,543,543
2,,324,344"""
df = pd.read_csv(io.StringIO(t),dtype=str)
df
Out[39]:
a b c d
0 145 567 999 876
1 999 876 543 543
2 NaN 324 344 NaN
In [44]:
df[(df.apply(lambda x: x.dropna().str[0].astype(int))).lt(5,axis=0).any(axis=1)]
Out[44]:
a b c d
0 145 567 999 876
2 NaN 324 344 NaN
Upvotes: 4