Reputation: 51
I would like to filter out pandas dataframe columns with a condition defined on its columns with a predicate function, for example (generally it may be much more sophisticated with rather complex dependencies between different elements of the series):
def detect_jumps(data, jump_factor=5):
for i in range(1, len(data)):
if data[i] - data[i - 1] > jump_factor:
return True
return False
on a dataframe df
:
import pandas as pd
data = [
{'A': '10', 'B': '10', 'C': '100', 'D': '100', 'E': '0', },
{'A': '15', 'B': '16', 'C': '105', 'D': '104', 'E': '10', },
{'A': '20', 'B': '20', 'C': '110', 'D': '110', 'E': '11', },
]
df = pd.DataFrame(data)
i.e.
A B C D E
0 10 10 100 100 0
1 15 16 105 104 10
2 20 20 110 110 11
It should only filter out columns B (col[1] - col[0] == 6 > 5) and D (col[2] - col[1] == 6 > 5)
or predicate detect_jumps(data, 9)
and in this case it should only filter out column E (col[1] - col[0] == 10 > 9)
Are there any ways to use such functions as a condition for filtering?
Upvotes: 0
Views: 473
Reputation: 261820
You don't need a custom function, use vectorial operations:
df2 = df.loc[:, ~df.astype(int).diff().gt(5).any()]
output:
A C
0 10 100
1 15 105
2 20 110
Nevertheless, using your function:
df2 = df.loc[:, [not detect_jumps(c) for label, c in df.astype(int).items()]]
# OR
df2 = df[[label for label, c in df.astype(int).items() if not detect_jumps(c)]]
Upvotes: 2