chengcj
chengcj

Reputation: 928

C++ Fast Percentile Calculation

I'm trying to write a percentile function that takes 2 vectors as input and 1 vector as output. One of the input vector (Distr) would be a distribution of random numbers. The other input vector (Tests) would be a vector of values that I want to calculate the percentile from Distr. The output would be a vector (same size as Tests) that returns the percentile for each value in Tests.

The following is an example of what I want:

Input Distr = {3, 5, 8, 12}
Input Tests = {4, 9}
Output Percentile = {0.375, 0.8125}

Following is my implementation in C++:

vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
    double prevValue, nextValue;
    vector<double> result;
    unsigned distrSize = Distr.size();

    std::sort(Distr.begin(), Distr.end());

    for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
    {

        if (*test <= Distr.front())
        {
            result.push_back((double) 1 / distrSize); // min percentile returned (not important)
        }
        else if (Distr.back() <= *test)
        {
            result.push_back(1); // max percentile returned (not important)
        }
        else
        {
            prevValue = Distr[0];
            for (unsigned sortedDistrIdx = 1; sortedDistrIdx < distrSize; sortedDistrIdx++)
            {
                nextValue = Distr[sortedDistrIdx];

                if (nextValue <= *test)
                {
                    prevValue = nextValue;
                }
                else
                {
                    // linear interpolation
                    result.push_back(((*test - prevValue) / (nextValue - prevValue) + sortedDistrIdx) / distrSize);
                    break;
                }
            }
        }
    }
    return result;
}

The size of both Distr and Tests can be from 2,000 to 30,000.

Are there any existing libraries that can calculate percentile as shown above (or similar)? If not how can I make the above code faster?

Upvotes: 4

Views: 17902

Answers (4)

Walter
Walter

Reputation: 45444

This answer is relevant to the case that input is initially random (not sorted) and test.size() smaller than input.size(), which is the most common situation.

Suppose there is only one test value. Then you only have to partition the input with respect to this value and obtain the upper(lower) bound of the lower(upper) parition to compute the respective percentile. This is much faster than a full sort on input (which quicksort implements as a recursion of partitions).

If test.size()>1, then you first sort test (ideally, test is already sorted and you can skip this step) and subsequently proceed with the test elements in increasing order, each time only partitioning the upper part from the previous partition. Since we also keep track of the lower bound of the upper partition (as well as the upper bound of the lower partition), we can detect if no input data are between consecutive test elements, and avoid to partition.

This algorithm should be near-optimal, since no unnecessary information is generated (as it would be with a full sort of input).

If subsequent partitioning splits the input roughly in half, the algorithm would be optimal. This could be approximated by proceeding not in increasing order of test, but by subsequent halving of test, i.e. starting with the median test element, then the first & third quartile, etc..

Upvotes: 0

JSF
JSF

Reputation: 5321

The linear search of Distr for each element of Tests would be the major amount of time if both of those are large.

When Distr is much larger, it is much faster to do a binary search instead of linear. There is a binary search algorithm available in std. You don't need to write one.

When Tests is nearly as big as Distr, or bigger, it is faster to do an index sort of Tests and then sequence through the two sorted lists together storing the results, then output the stored results in a next pass.

Edit: I see the answer by Csaba Balint gives a little more detail on what I meant by "sequence through the two sorted lists together".

Edit: The basic methods being discussed are:
1) Sort both lists and then process linearly together, time NlogN+MlogM
2) Sort just one list and binary search, time (N+M)logM
3) Sort just the other list and partition, time I haven't figured out, but in the case of N and M similar, it has to be larger than either method 1 or 2, and in the case of N sufficiently tiny has to be smaller than methods 1 or 2.

Upvotes: 0

Monfico
Monfico

Reputation: 154

I would do something like

vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
    double prevValue, nextValue;
    vector<double> result;
    unsigned distrSize = Distr.size();

    std::sort(Distr.begin(), Distr.end());

    for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
    {
        if (*test <= Distr.front())
        {
            result.push_back((double) 1 / distrSize); // min percentile returned (not important)
        }
        else if (Distr.back() <= *test)
        {
            result.push_back(1); // max percentile returned (not important)
        }
        else
        {
            auto it = lower_bound(Distr.begin(), Distr.end(), *test);
            prevValue = *(it - 1);
            nextValue = *(it + 1);
            // linear interpolation
            result.push_back(((*test - prevValue) / (nextValue - prevValue) + (it - Distr.begin())) / distrSize);
        }
    }
    return result;
}

Note that instead of making a linear search on Distr for each test, I leverage the fact that Distr is sorted and make a binary search instead (using lower_bound).

Upvotes: 1

Csaba B&#225;lint
Csaba B&#225;lint

Reputation: 86

There is a linear algorithm for your problem (linear times logarithmic in both sizes). You need to sort both vectors, and then have two iterators going through each (itDistr, itTest). There are three possibilities:

1. *itDistr < *itTest

Here, you have nothing to except increment itDistr.

2. *itDistr >= *itTest

This is the case when you found a test case where *itTest is element of the interval [ *(itDistr-1), *itDistr ). So you have to do the interpolation you have used (linear), and then increment itTest.

The third possibility is where any of then reaches the end of its container vector. You also have to define what happens in the beginning and at the and, it depends on how you define the distribution from the series of your numbers.

Are there any existing libraries that can calculate percentile as shown above (or similar)?

Probably, but it is easy to implement it, and you can have fine control over the interpolation technique.

Upvotes: 0

Related Questions