TX412
TX412

Reputation: 33

Filtering row values in pandas by threshold value

I have a pandas correlation matrix dataframe that has hundreds of columns and rows. I want to filter the whole dataframe so that i only get cells that are above a certain value, any row value > .4, for example. I'm not sure what is the best way to do that. I've seen multiple posts that filter by column name, but with hundreds of columns, how would you iterate through each column?

Upvotes: 0

Views: 3888

Answers (1)

jezrael
jezrael

Reputation: 862581

If want filter only rows use boolean indexing with mask created by gt > with any for check at least one True:

df[df.gt(.4).any(axis=1)]

Or if want filter rows and columns:

m = df.gt(.4)
df.loc[m.any(axis=1), m.any()]

If want only values by condition is possible use where for replace another values to NaNs:

df.where(df.gt(.4))

Sample:

np.random.seed(4137)
df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
df['A'] /=  10
print (df)
          A         B         C
0  0.090262  0.313517  0.319292
1  0.086841  0.346788  0.314144
2  0.044069  0.764295  0.886659
3  0.089739  0.230103  0.594301
4  0.047101  0.132819  0.692846

df1 = df[df.gt(.4).any(axis=1)]
print (df1)
          A         B         C
2  0.044069  0.764295  0.886659
3  0.089739  0.230103  0.594301
4  0.047101  0.132819  0.692846

m = df.gt(.4)
df2 = df.loc[m.any(axis=1), m.any()]
print (df2)
          B         C
2  0.764295  0.886659
3  0.230103  0.594301
4  0.132819  0.692846

df3 = df.where(df.gt(.4))
print (df3)
    A         B         C
0 NaN       NaN       NaN
1 NaN       NaN       NaN
2 NaN  0.764295  0.886659
3 NaN       NaN  0.594301
4 NaN       NaN  0.692846

Detail:

print (m)
       A      B      C
0  False  False  False
1  False  False  False
2  False   True   True
3  False  False   True
4  False  False   True

print (m.any(axis=1))
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (m.any())
A    False
B     True
C     True
dtype: bool

Upvotes: 2

Related Questions