Reputation: 348
I feel like the answer to my question may be obvious, but I can't quite figure it out. I want to know the best way (or any good way), in python, to threshold a numerical variable such that the average of the values above this threshold (in my case, it happens to be above, but it could just as easily be below) is equal to a particular, given number. I would be happy with any efficient or efficient-ish solution using numpy or pandas.
Start with a pandas series (or a 1D numpy array) such as:
[0.1, 0.2, 0.3, 0.4, 0.5]
(in practice, the series or array may be very long). Suppose for instance that the given number, which is the target average, is 0.35. In this case, we can eyeball that the desired threshold must be any number greater than or equal to 0.1 but less than 0.2, since the average of 0.2, 0.3, 0.4, and 0.5 (all above the threshold) is equal to 0.35. (In particular, the answer isn't unique.)
Also, unlike the toy example above, in some cases it may be impossible to exactly match the given number. But I still want to solve for a threshold such that the average of all values above that threshold is as close to the given number as possible.
Any advice on how to accomplish this in Python is greatly appreciated. In particular, if there exists a numpy or pandas method that does this, please let me know. And if my question requires further clarification, please let me know. Thank you!
Upvotes: 4
Views: 1099
Reputation: 11171
You can calculate the average for the threshold at each element in your original array:
import numpy as np
import pandas as pd
x = np.sort(np.random.random(20))
n = np.arange(1, len(x) + 1, 1)
# cumulative sum of x in reverse order / num elements gives threshold means:
threshold_means = np.cumsum(x[::-1])/n
df = pd.DataFrame(dict(threshold=x[::-1], threshold_means=threshold_means))
df = df.sort_values("threshold").reset_index(drop=True)
output:
threshold threshold_means
0 0.036453 0.474160
1 0.057774 0.497197
2 0.060959 0.521609
3 0.095344 0.548706
4 0.218508 0.577042
5 0.229380 0.600944
6 0.281243 0.627484
7 0.298807 0.654118
8 0.340491 0.683727
9 0.374211 0.714931
10 0.514332 0.749003
11 0.554557 0.775077
12 0.590041 0.802642
13 0.672917 0.833014
14 0.788553 0.859697
15 0.800751 0.873925
16 0.863758 0.892219
17 0.870211 0.901706
18 0.874873 0.917453
19 0.960032 0.960032
This is quite performant; it takes less than a second for len(x) = 1 million. If you had billions, you could do a binary search or something as the threshold sum should be monotonic.
Upvotes: 2