mevdiven
mevdiven

Reputation: 1902

Selecting the top N rows of a dataframe based on a threshold

I have this data set with keys and their associated confidence values.

values = [('S08', -6276.0), ('S01', -6360.0), ('S03', -6504.0), ('C01', -521682.0), 
          ('C03', -556262.0), ('C08', -558108.0), ('S06', -1723974.0),
          ('S09', -2379806.0), ('C06', -2472398.0), ('C09', -2930688.0)]
df = pd.DataFrame(values, columns=['key', 'confidence'])

   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0
3  C01   -521682.0
4  C03   -556262.0
5  C08   -558108.0
6  S06  -1723974.0
7  S09  -2379806.0
8  C06  -2472398.0
9  C09  -2930688.0

In this case, top 3 rows are the ones with very high confidence values and need to be selected. The rest of the rows (starting from the fourth one) have confidence values are very far away from top 3 and need to be discarded. TopN rows could vary from 1 to 9 dynamically.

Upvotes: 2

Views: 1144

Answers (2)

cs95
cs95

Reputation: 402263

Apply a threshold for filtering with boolean indexing.

thresh = 0.0005 * df.confidence.std() # for example 
df = df[df.confidence.diff().fillna(0).abs() < thresh]
df
   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0

To retain NaNs, use df.where

df.confidence = df.confidence.where(df.confidence.diff().fillna(0).abs() < thresh)
df  
   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0
3  C01         NaN
4  C03         NaN
5  C08         NaN
6  S06         NaN
7  S09         NaN
8  C06         NaN
9  C09         NaN

Upvotes: 3

mwweb
mwweb

Reputation: 7915

Or using pandas.DataFrame.nlargest

df=pandas.DataFrame(values, columns=['key', 'confidence']).nlargest(3, 'confidence')

nlargest(3, 'confidence')

   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.nlargest.html

Upvotes: 2

Related Questions