user3080286
user3080286

Reputation: 311

Calculate the median in MongoDB aggregation framework

Is there a way to calculate the median using the MongoDB aggregation framework?

Upvotes: 31

Views: 20329

Answers (7)

nimrod serok
nimrod serok

Reputation: 16033

Since mongoDB version 7.0 there is a $median accumulator. For example:

db.collection.aggregate([
  {$group: {
      _id: null,
      median: {
        $median: {
          input: "$rating",
          method: "approximate"
        }
      }
    }
  }
])

See how it works on the playground since version 7.0

Upvotes: 1

JTan
JTan

Reputation: 209

My solution to this problem is quite similar to Taher's answer but with fewer $project stages.

// { "value" : 1 }
// { "value" : 2  }
// { "value" : 4 }
// { "value" : 5 }
db.median_values.aggregate([
  // Sort the values
  { $sort: { value: 1 } },
  // Get an array of all the values
  { $group: { _id: null, valuesArray: { $push: "$value" } } },
  // Get if the array has an even or odd number of elements
  {
    $project: {
      _id: 0,
      valuesArray: 1,
      isEven: { $eq: [{ $mod: [{ $size: "$valuesArray" }, 2] }, 0] },
      dividedByTwoIndex: { $divide: [{ $size: "$valuesArray" }, 2] },
    },
  },
  // Get the left value and right value if the array has an even or odd number of elements
  {
    $project: {
      _id: 0,
      left: {
        $cond: {
          if: "$isEven",
          then: {
            $arrayElemAt: [
              "$valuesArray",
              { $subtract: ["$dividedByTwoIndex", 1] },
            ],
          },
          else: {
            $arrayElemAt: ["$valuesArray", { $floor: "$dividedByTwoIndex" }],
          },
        },
      },
      right: {
        $cond: {
          if: "$isEven",
          then: {
            $arrayElemAt: ["$valuesArray", "$dividedByTwoIndex"],
          },
          else: {
            $arrayElemAt: ["$valuesArray", { $floor: "$dividedByTwoIndex" }],
          },
        },
      },
    },
  },

  // Compute the median value
  { $project: { median: { $avg: ["$left", "$right"] } } },
]);

// Output:
// { "median" : 3 }

Upvotes: 0

Taher
Taher

Reputation: 752

While maxiplay's answer is not accurate, it did lead me in the correct direction. The problem with the given solution is that it only works when the number of records is even. Because for odd number of records, one simply needs to take the value at midpoint without having to calculate an average.

This is how I got it to work.

db.collection.aggregate([
{ "$match": { "processingStatus": "Completed" } },
{ "$sort": { "value": 1 } },
{ 
    "$group": {
        "_id": "$userId",
        "valueArray": {
            "$push": "$value"
        }
    } 
},
{
    "$project": {
        "_id": 0,
        "userId": "$_id",
        "valueArray": 1,
        "size": { "$size": ["$valueArray"] }
    }
},
{
    "$project": {
        "userId": 1,
        "valueArray": 1,
        "isEvenLength": { "$eq": [{ "$mod": ["$size", 2] }, 0 ] },
        "middlePoint": { "$trunc": { "$divide": ["$size", 2] } }
    }
},
{
    "$project": {
        "userId": 1,
        "valueArray": 1,
        "isEvenLength": 1,
        "middlePoint": 1,
        "beginMiddle": { "$subtract": [ "$middlePoint", 1] },
        "endMiddle": "$middlePoint"
    }
},
{
    "$project": {
        "userId": 1,
        "valueArray": 1,
        "middlePoint": 1,
        "beginMiddle": 1,
        "beginValue": { "$arrayElemAt": ["$valueArray", "$beginMiddle"] },
        "endValue": { "$arrayElemAt": ["$valueArray", "$endMiddle"] },
        "isEvenLength": 1
    }
},
{
    "$project": {
        "userId": 1,
        "valueArray": 1,
        "middlePoint": 1,
        "beginMiddle": 1,
        "beginValue": 1,
        "endValue": 1,
        "middleSum": { "$add": ["$beginValue", "$endValue"] },
        "isEvenLength": 1
    }
},
{
    "$project": {
        "userId": 1,
        "valueArray": 1,
        "median": { 
            "$cond": { 
                if: "$isEvenLength", 
                then: { "$divide": ["$middleSum", 2] },
                else:  { "$arrayElemAt": ["$valueArray", "$middlePoint"] }
            } 
        }
    }
}
])

Upvotes: 5

Xavier Guihot
Xavier Guihot

Reputation: 61716

Starting Mongo 4.4, the $group stage has a new aggregation operator $accumulator allowing custom accumulations of documents as they get grouped, via javascript user defined functions.

Thus, in order to find the median:

// { "a" : 25, "b" : 12 }
// { "a" : 89, "b" : 7  }
// { "a" : 25, "b" : 17 }
// { "a" : 25, "b" : 24 }
// { "a" : 89, "b" : 15 }
db.collection.aggregate([
  { $group: {
    _id: "$a",
    median: {
      $accumulator: {
        accumulateArgs: ["$b"],
        init: function() { return []; },
        accumulate: function(bs, b) { return bs.concat(b); },
        merge: function(bs1, bs2) { return bs1.concat(bs2); },
        finalize: function(bs) {
          bs.sort(function(a, b) { return a - b });
          var mid = bs.length / 2;
          return mid % 1 ? bs[mid - 0.5] : (bs[mid - 1] + bs[mid]) / 2;
        },
        lang: "js"
      }
    }
  }}
])
// { "_id" : 25, "median" : 17 }
// { "_id" : 89, "median" : 11 }

The accumulator:

  • accumulates on the field b (accumulateArgs)
  • is initialised to an empty array (init)
  • accumulates b items in an array (accumulate and merge)
  • and finally performs the median calculation on b items (finalize)

Upvotes: 8

maxiplay
maxiplay

Reputation: 427

It's possible to do it in one shot with the aggregate framework.

Sort => put in Array sorted values => get Size of array => divide size by two => get Int value of the division (left side of median) => add 1 to left side ( right side) => get array element at left side and right side => average of the two elements

This is a sample with Spring java mongoTemplate :

The model is a list of book with the login of the author ("owner"), the objective is to get the median of book by users :

        GroupOperation countByBookOwner = group("owner").count().as("nbBooks");

    SortOperation sortByCount = sort(Direction.ASC, "nbBooks");

    GroupOperation putInArray = group().push("nbBooks").as("nbBooksArray");

    ProjectionOperation getSizeOfArray = project("nbBooksArray").and("nbBooksArray").size().as("size");

    ProjectionOperation divideSizeByTwo = project("nbBooksArray").and("size").divide(2).as("middleFloat");

    ProjectionOperation getIntValueOfDivisionForBornLeft = project("middleFloat", "nbBooksArray").and("middleFloat")
            .project("trunc").as("beginMiddle");

    ProjectionOperation add1ToBornLeftToGetBornRight = project("beginMiddle", "middleFloat", "nbBooksArray")
            .and("beginMiddle").project("add", 1).as("endMiddle");

    ProjectionOperation arrayElementAt = project("beginMiddle", "endMiddle", "middleFloat", "nbBooksArray")
            .and("nbBooksArray").project("arrayElemAt", "$beginMiddle").as("beginValue").and("nbBooksArray")
            .project("arrayElemAt", "$endMiddle").as("endValue");

    ProjectionOperation averageForMedian = project("beginMiddle", "endMiddle", "middleFloat", "nbBooksArray",
            "beginValue", "endValue").and("beginValue").project("avg", "$endValue").as("median");

    Aggregation aggregation = newAggregation(countByBookOwner, sortByCount, putInArray, getSizeOfArray,
            divideSizeByTwo, getIntValueOfDivisionForBornLeft, add1ToBornLeftToGetBornRight, arrayElementAt,
            averageForMedian);

    long time = System.currentTimeMillis();
    AggregationResults<MedianContainer> groupResults = mongoTemplate.aggregate(aggregation, "book",
            MedianContainer.class);

And here the resulting aggregation :

{
"aggregate": "book" ,
"pipeline": [
    {
        "$group": {
            "_id": "$owner" ,
            "nbBooks": {
                "$sum": 1
            }
        }
    } , {
        "$sort": {
            "nbBooks": 1
        }
    } , {
        "$group": {
            "_id": null  ,
            "nbBooksArray": {
                "$push": "$nbBooks"
            }
        }
    } , {
        "$project": {
            "nbBooksArray": 1 ,
            "size": {
                "$size": ["$nbBooksArray"]
            }
        }
    } , {
        "$project": {
            "nbBooksArray": 1 ,
            "middleFloat": {
                "$divide": ["$size" , 2]
            }
        }
    } , {
        "$project": {
            "middleFloat": 1 ,
            "nbBooksArray": 1 ,
            "beginMiddle": {
                "$trunc": ["$middleFloat"]
            }
        }
    } , {
        "$project": {
            "beginMiddle": 1 ,
            "middleFloat": 1 ,
            "nbBooksArray": 1 ,
            "endMiddle": {
                "$add": ["$beginMiddle" , 1]
            }
        }
    } , {
        "$project": {
            "beginMiddle": 1 ,
            "endMiddle": 1 ,
            "middleFloat": 1 ,
            "nbBooksArray": 1 ,
            "beginValue": {
                "$arrayElemAt": ["$nbBooksArray" , "$beginMiddle"]
            } ,
            "endValue": {
                "$arrayElemAt": ["$nbBooksArray" , "$endMiddle"]
            }
        }
    } , {
        "$project": {
            "beginMiddle": 1 ,
            "endMiddle": 1 ,
            "middleFloat": 1 ,
            "nbBooksArray": 1 ,
            "beginValue": 1 ,
            "endValue": 1 ,
            "median": {
                "$avg": ["$beginValue" , "$endValue"]
            }
        }
    }
]

}

Upvotes: 12

drmirror
drmirror

Reputation: 3760

The median is somewhat tricky to compute in the general case, because it involves sorting the whole data set, or using a recursion with a depth that is also proportional to the data set size. That's maybe the reason why many databases don't have a median operator out of the box (MySQL doesn't have one, either).

The simplest way to compute the median would be with these two statements (assuming the attribute on which we want to compute the median is called a and we want it over all documents in the collection, coll):

count = db.coll.count();
db.coll.find().sort( {"a":1} ).skip(count / 2 - 1).limit(1);

This is the equivalent to what people suggest for MySQL.

Upvotes: 42

Philipp
Philipp

Reputation: 69703

The aggregation framework doesn't support median out-of-the-box. So you will have to write something on your own.

I would recommend you to do this on the application level. Retrieve all your documents with a normal find(), sort the result sets (either on the datbase by using the .sort() function of the cursor or sorting them in the application - your decision) and then getting the element size / 2.

When you really want to do it on the database level, you could do it with map-reduce. The map-function would emit key and an array with a single value - the value you want to get the median of. The reduce-function would just concatenate the arrays of the results it receives, so each key ends up with an array with all values. The finalize-function would then compute the median of that array, again by by sorting the array and then get the element number size / 2.

Upvotes: 2

Related Questions