Reputation: 61
Given:
Write a function kmax(x, k) that returns the Mean and Standard Deviation of the kth highest number of infinite random samples of size x. Assuming replacement.
Kicker: needs to be stable with very high numbers. For some rough benchmarks:
Preferably linear time complexity
If relevant, for my purposes all items in "values" are positive floats ranging from 0.000000 to 400.000000
I have working code, however it breaks down at higher numbers. Specifically, when k is a high percentage of x. I also have working code for the case where k=1 that is quite effective. I am having trouble making it robust to run as components get infinitely small and brick the program. Surely I am missing a very simple solution.
values = np.sort(merged_df["INPUT_NUMBERS"].values) # Sorted for ECDF
n = len(values)
def empirical_ecdf(values):
# empirical cumulative distribution function (ECDF)
return np.arange(1, n + 1) / n
import numpy as np
import scipy.special as sp
def kmax(x, k):
n = len(values)
ecdf = empirical_ecdf(values)
pmf_k = np.zeros(n)
for i in range(n):
if i == n - 1:
pmf_k[i] = sp.comb(x, k) * ((1 - ecdf[i-1]) ** k) * (ecdf[i-1] ** (x - k))
else:
pmf_k[i] = sp.comb(x, k) * ((1 - ecdf[i]) ** k - (1 - ecdf[i+1]) ** k) * (ecdf[i] ** (x - k))
pmf_k = pmf_k / np.sum(pmf_k)
mean_kth = np.sum(values * pmf_k)
mean_kth_sq = np.sum(values**2 * pmf_k)
std_kth = np.sqrt(mean_kth_sq - mean_kth**2)
return mean_kth, std_kth
Upvotes: 5
Views: 107