Reputation: 39
Let say I have pandas DataFrame like this:
NO_DOC | NO_SEQ | DESC
A1 | 1 | A
B1 | 2 | B
C1 | 3 | C
I want to search row index from data above on a certain filter: I am able to find it using statement:
idx = data.index[data["NO_DOC"] == "A1" && data["NO_SEQ"] == 1]
but what I want to ask is how to make the "condition" is variable because I want to make it in a function.
Example:
def find(self, data, columns, value):
return data.index[data[...] == value[...] && ....].tolist()
columns parameters is ["NO_DOC","NO_SEQ"]
value is ["A1", 1]
the filter condition above is: data["NO_DOC"] == "A1" && data["NO_SEQ"] == 1
How do I make a function to find row index on pandas DataFrame with a certain filter condition?
Upvotes: 1
Views: 1756
Reputation: 35686
We can use zip
with a list comprehension to build a DataFrame from each Series of boolean values:
mask_df = pd.DataFrame([
data[col] == val for col, val in zip(columns, values)
])
0 1 2
NO_DOC True False False
NO_SEQ True False False
Then we can reduce this DataFrame of booleans with all
to check which columns (indexes) have only True values (logical AND):
mask = mask_df.all()
0 True
1 False
2 False
dtype: bool
Note: logical OR can be achieved with any
instead of all
Now we can use the mask
to filter the indexes:
data.index[mask].tolist()
[0]
Together find
can look something like:
def find(data, columns, values):
# Create a DataFrame of boolean conditions
# Take logical AND using DataFrame.all
mask = pd.DataFrame([
data[col] == val for col, val in zip(columns, values)
]).all()
# Filter and convert to list
return data.index[mask].tolist()
Or with np.logical_and and .reduce which behaves in with the same logic as the DataFrame but can be much faster as it does not need to maintain pandas objects:
def find(data, columns, values):
mask = np.logical_and.reduce(
[data[col] == val for col, val in zip(columns, values)]
)
return data.index[mask].tolist()
Some timing via timeit.
Setup:
import numpy as np
import pandas as pd
def find_with_df_mask(data, columns, values):
mask = pd.DataFrame([
data[col] == val for col, val in zip(columns, values)
]).all()
return data.index[mask].tolist()
def find_with_np_mask(data, columns, values):
mask = np.logical_and.reduce(
[data[col] == val for col, val in zip(columns, values)]
)
return data.index[mask].tolist()
df = pd.concat([pd.DataFrame({
'NO_DOC': ['A1', 'B1', 'C1'],
'NO_SEQ': [1, 2, 3],
'DESC': ['A', 'B', 'C']
})] * 1000, ignore_index=True)
cols = ["NO_DOC", "NO_SEQ"]
vals = ["A1", 1]
Timings:
%timeit find_with_df_mask(df, cols, vals)
21.9 ms ± 65.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit find_with_np_mask(df, cols, vals)
319 µs ± 253 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Reputation: 11
To write Pandas dataframe filter condition from parameters in a function:
import pandas as pd
raw_data = {'NO_DOC':['A1','B1','C1'],'NO_SEQ':[1,2,3],'DESC':['A','B','C']}
data = pd.DataFrame(raw_data)
columns = ["NO_DOC", "NO_SEQ"]
value = ["A1", 1]
Create masks and send them to functions instead of columns and values:
mask1 = data[columns[0]] == value[0]
mask2 = data[columns[1]] == value[1]
def find(data, mask1,mask2):
return data.index[mask1 & mask2].tolist()
Call your function as follows:
find(data, mask1, mask2)
Upvotes: 1