Reputation: 1902
I have this data set with keys and their associated confidence values.
values = [('S08', -6276.0), ('S01', -6360.0), ('S03', -6504.0), ('C01', -521682.0),
('C03', -556262.0), ('C08', -558108.0), ('S06', -1723974.0),
('S09', -2379806.0), ('C06', -2472398.0), ('C09', -2930688.0)]
df = pd.DataFrame(values, columns=['key', 'confidence'])
key confidence
0 S08 -6276.0
1 S01 -6360.0
2 S03 -6504.0
3 C01 -521682.0
4 C03 -556262.0
5 C08 -558108.0
6 S06 -1723974.0
7 S09 -2379806.0
8 C06 -2472398.0
9 C09 -2930688.0
In this case, top 3 rows are the ones with very high confidence values and need to be selected. The rest of the rows (starting from the fourth one) have confidence values are very far away from top 3 and need to be discarded. TopN rows could vary from 1 to 9 dynamically.
Upvotes: 2
Views: 1144
Reputation: 402263
Apply a threshold for filtering with boolean indexing.
thresh = 0.0005 * df.confidence.std() # for example
df = df[df.confidence.diff().fillna(0).abs() < thresh]
df
key confidence
0 S08 -6276.0
1 S01 -6360.0
2 S03 -6504.0
To retain NaNs
, use df.where
df.confidence = df.confidence.where(df.confidence.diff().fillna(0).abs() < thresh)
df
key confidence
0 S08 -6276.0
1 S01 -6360.0
2 S03 -6504.0
3 C01 NaN
4 C03 NaN
5 C08 NaN
6 S06 NaN
7 S09 NaN
8 C06 NaN
9 C09 NaN
Upvotes: 3
Reputation: 7915
Or using pandas.DataFrame.nlargest
df=pandas.DataFrame(values, columns=['key', 'confidence']).nlargest(3, 'confidence')
key confidence
0 S08 -6276.0
1 S01 -6360.0
2 S03 -6504.0
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.nlargest.html
Upvotes: 2