Reputation: 33
I have a pandas correlation matrix dataframe that has hundreds of columns and rows. I want to filter the whole dataframe so that i only get cells that are above a certain value, any row value > .4, for example. I'm not sure what is the best way to do that. I've seen multiple posts that filter by column name, but with hundreds of columns, how would you iterate through each column?
Upvotes: 0
Views: 3888
Reputation: 862581
If want filter only rows use boolean indexing
with mask created by gt
>
with any
for check at least one True
:
df[df.gt(.4).any(axis=1)]
Or if want filter rows and columns:
m = df.gt(.4)
df.loc[m.any(axis=1), m.any()]
If want only values by condition is possible use where
for replace another values to NaN
s:
df.where(df.gt(.4))
Sample:
np.random.seed(4137)
df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
df['A'] /= 10
print (df)
A B C
0 0.090262 0.313517 0.319292
1 0.086841 0.346788 0.314144
2 0.044069 0.764295 0.886659
3 0.089739 0.230103 0.594301
4 0.047101 0.132819 0.692846
df1 = df[df.gt(.4).any(axis=1)]
print (df1)
A B C
2 0.044069 0.764295 0.886659
3 0.089739 0.230103 0.594301
4 0.047101 0.132819 0.692846
m = df.gt(.4)
df2 = df.loc[m.any(axis=1), m.any()]
print (df2)
B C
2 0.764295 0.886659
3 0.230103 0.594301
4 0.132819 0.692846
df3 = df.where(df.gt(.4))
print (df3)
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN 0.764295 0.886659
3 NaN NaN 0.594301
4 NaN NaN 0.692846
Detail:
print (m)
A B C
0 False False False
1 False False False
2 False True True
3 False False True
4 False False True
print (m.any(axis=1))
0 False
1 False
2 True
3 True
4 True
dtype: bool
print (m.any())
A False
B True
C True
dtype: bool
Upvotes: 2