Reputation: 1902

Selecting the top N rows of a dataframe based on a threshold

I have this data set with keys and their associated confidence values.

values = [('S08', -6276.0), ('S01', -6360.0), ('S03', -6504.0), ('C01', -521682.0), 
          ('C03', -556262.0), ('C08', -558108.0), ('S06', -1723974.0),
          ('S09', -2379806.0), ('C06', -2472398.0), ('C09', -2930688.0)]
df = pd.DataFrame(values, columns=['key', 'confidence'])

   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0
3  C01   -521682.0
4  C03   -556262.0
5  C08   -558108.0
6  S06  -1723974.0
7  S09  -2379806.0
8  C06  -2472398.0
9  C09  -2930688.0

In this case, top 3 rows are the ones with very high confidence values and need to be selected. The rest of the rows (starting from the fourth one) have confidence values are very far away from top 3 and need to be discarded. TopN rows could vary from 1 to 9 dynamically.

Upvotes: 2

Answers (2)

cs95

Reputation: 403128

Apply a threshold for filtering with boolean indexing.

thresh = 0.0005 * df.confidence.std() # for example 
df = df[df.confidence.diff().fillna(0).abs() < thresh]
df
   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0

To retain NaNs, use df.where

df.confidence = df.confidence.where(df.confidence.diff().fillna(0).abs() < thresh)
df  
   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0
3  C01         NaN
4  C03         NaN
5  C08         NaN
6  S06         NaN
7  S09         NaN
8  C06         NaN
9  C09         NaN

Upvotes: 3

mwweb

Reputation: 7925

Or using pandas.DataFrame.nlargest

df=pandas.DataFrame(values, columns=['key', 'confidence']).nlargest(3, 'confidence')

nlargest(3, 'confidence')

   key  confidence
0  S08     -6276.0
1  S01     -6360.0
2  S03     -6504.0

http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.nlargest.html

Upvotes: 2

Selecting the top N rows of a dataframe based on a threshold

Answers (2)

nlargest(3, 'confidence')

Related Questions