Reputation: 1007
Lets say I have a pandas dataframe df
with columns A, B, C, D, E, F, G, H
, and I want to filter the dataframe using a function functn
that takes in a "row" and returns true or false based on if the row fulfills certain conditions (lets say the function uses every column except for H
). Is there a way to efficiently filter this dataframe without a long and ugly lambda? The solution I have so far looks like this:
df = df[df.apply(functn, axis=1)]
but this method seems to be VERY slow, even for a frame with 15k lines. Is there a clean and efficient way to filter a pandas dataframe using a user defined python function instead of a lambda or query?
note: I previously implemented this using plain python 2d arrays and it was MUCH faster than using pandas. Am I misusing a certain feature or not aware of a way to make this filtering process faster?
edit:
The data is structured roughly like this:
# A B C D E F G H
[
[string1, string2, int1, int2, int3, int4, float1, float2],
...
]
The function does something like this:
def filter(row):
var1 = row.G <= 0.01
partial_a = (((row.D - row.C + 1)*1.0)/global_map[row.A])
partial_b = (((row.F - row.E + 1)*1.0)/global_map[row.B])
partial = partial_a >= 0.66 or partial_b >= 0.66
return var1 and partial
The non-pandas implementation basically took the dataframe, which if not in pandas form was basically a 2d array, and looped through each element, applied the function to it (except the argument was a list instead of a "row"), and if it returned true, added that new element to another list.
Upvotes: 2
Views: 6498
Reputation: 153460
IIUC, you don't need a function. Let's use boolean indexing as follows:
cond1 = df['G'] <= 0.01
cond2 = (((df.D - df.C + 1)*1.0)/global_map[df.A]) >= 0.66
cond3 = (((df.F - df.E + 1)*1.0)/global_map[df.B]) >= 0.66
mask = cond1 & (cond2 | cond3)
df[mask]
Upvotes: 4