Taris
Taris

Reputation: 33

Median of medians - is this possible or is there a different way

Currently i am aggregating big amount of data on a daily basis and for each day i am calculating a median of the current values. Now i need to aggregate all this daily results into a monthly basis and of course i need to calculate the median again.

Is there a way to calculate a median of medians and have it statistically correct? I want to avoid to use the raw data again, because it is a huge amount of it :)

As a small proof of concept i made this javascript - maybe it helps to find a way:

var aSortedNumberGroups = [];
var aSortedNumbers = [];
var aMedians = [];

Math.median = function(aData)
{
    var fMedian = 0;
    var iIndex = Math.floor(aData.length/2);
    if (!(aData.length%2)) {
        fMedian = (aData[iIndex-1]+aData[iIndex])/2;
    } else {
        fMedian = aData[iIndex];
    }

    return fMedian;
};

for (var iCurrGroupNum = 0; iCurrGroupNum < 5; ++iCurrGroupNum) {
    var aCurrNums = [];
    for (var iCurrNum = 0; iCurrNum < 1000; ++iCurrNum) {
        var iCurrRandomNumber = Math.floor(Math.random()*10001);
        aCurrNums.push(iCurrRandomNumber);
        aSortedNumbers.push(iCurrRandomNumber);
    }
    aCurrNums.sort(function(oCountA,oCountB) {
        return (iNumA < iNumB) ? -1 : 1;
    });
    aSortedNumberGroups.push(aCurrNums);
    aMedians.push(Math.median(aCurrNums));
}

console.log("Medians of each group: "+JSON.stringify(aMedians, null, 4));
console.log("Median of medians: "+Math.median(aMedians));
console.log("Median of all: "+Math.median(aSortedNumbers));

As you will see there is often a huge cap between the median of all raw numbers and the median of medians and i like to have it pretty close to each other.

Thanks alot!

Upvotes: 3

Views: 8767

Answers (4)

user2939459
user2939459

Reputation: 11

I know this is a very dated thread, but future readers may find Tukey's Ninther method quite relevant ... analysis here: http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/

-kg

Upvotes: 1

btilly
btilly

Reputation: 46455

Yet another approach is to take each day's data, parse it, and store it in sorted order. For a given day you can just look at the median piece of data and you've got your answer.

At the end of the month you can do a quick-select to find the median. You can take advantage of the sorted order of each day's data to do a binary search to split it. The result is that your end of month processing will be very, very quick.

The same kind of data, organized in the same kind of way, will also let you do various percentiles very cheaply. The only hard part is extracting each day's raw data and sorting it.

Upvotes: 0

quentinxs
quentinxs

Reputation: 866

No, unfortunately there is not a way to calculate the median based on medians of subsets of the whole and still be statistically accurate. If you wanted to calculate the mean, however, you could use the means of subsets, given that they are of equal size.

ck's optimization above could be of assistance to you.

Upvotes: 3

ckozl
ckozl

Reputation: 6761

you don't actually "calculate" a median you "discover" it through redistribution into subsets, the only optimization for this is a reloadable "tick chart" or running tally: e.g. store each occurrence with the number of times it occurred this way you can recreate the distribution without actually having to reparse the raw data. This is only a small optimization, but depending on the repetition of the data set in question you could save yourself tons of MB and at the very least a bunch of processor cycles.

think of it in JSON: { '1': 3, '5': 12, '7': 4 } canonical: '1' has occurred 3 times, '5' has occurred 12 times, etc...

then persist those counts for the starting at the beginning of time period in which you want to get a median for.

hope this helps -ck

Upvotes: 4

Related Questions