Mean and Standard Devation of Kth Highest Number Given Infinite Samples of Size x

Question

Given:

A list of numbers, "values".
A sample size, x
A kth highest number (ex: [3, 2, 1, 0], 2 is the 2nd highest number)

Write a function kmax(x, k) that returns the Mean and Standard Deviation of the kth highest number of infinite random samples of size x. Assuming replacement.

Kicker: needs to be stable with very high numbers. For some rough benchmarks:

len(values) may be up to a million
x may be up to 500 000
k may be up to x

Preferably linear time complexity

If relevant, for my purposes all items in "values" are positive floats ranging from 0.000000 to 400.000000

I have working code, however it breaks down at higher numbers. Specifically, when k is a high percentage of x. I also have working code for the case where k=1 that is quite effective. I am having trouble making it robust to run as components get infinitely small and brick the program. Surely I am missing a very simple solution.

values = np.sort(merged_df["INPUT_NUMBERS"].values)  # Sorted for ECDF
n = len(values)

def empirical_ecdf(values):
    # empirical cumulative distribution function (ECDF)
    return np.arange(1, n + 1) / n

import numpy as np
import scipy.special as sp

def kmax(x, k):
    n = len(values)
    ecdf = empirical_ecdf(values)
    
    pmf_k = np.zeros(n)
    
    for i in range(n):
        if i == n - 1:
            pmf_k[i] = sp.comb(x, k) * ((1 - ecdf[i-1]) ** k) * (ecdf[i-1] ** (x - k))
        else:
            pmf_k[i] = sp.comb(x, k) * ((1 - ecdf[i]) ** k - (1 - ecdf[i+1]) ** k) * (ecdf[i] ** (x - k))
    
    pmf_k = pmf_k / np.sum(pmf_k)
    
    mean_kth = np.sum(values * pmf_k)
    mean_kth_sq = np.sum(values**2 * pmf_k)
    
    std_kth = np.sqrt(mean_kth_sq - mean_kth**2)
    
    return mean_kth, std_kth

Mean and Standard Devation of Kth Highest Number Given Infinite Samples of Size x

Answers (0)

Related Questions