user9817468211
user9817468211

Reputation: 11

Is $push+$reduce really slower than a custom accumulator?

Let's say I have a few million documents in a MongoDB collection that look like this:

[
  {
    "timestamp": "2024-01-01T00:00:00Z",
    "schema": "1.0.0",
    "value": 3,
  },
  {
    "timestamp": "2024-01-01T01:00:00Z",
    "schema": "1.2.0",
    "value": -10,
  },
  ...
]

Now, I want to do the following, using an aggregation pipeline:

So, the desired output is something like:

[
    // January bucket
  {
    "bucket": "2024-01-01T00:00:00Z",
    "value": {
      "timestamp": "2024-01-01T01:00:00Z",
      "schema": "1.2.0",
      "absMax": -10
    }
  }
]

Obviously, the default $max accumulator does not work, as it has two problems:

So, I tackled the problem trying two different ways:

  1. In my $group stage, I use $push to push all raw documents into a $raw document, which I then go through with $reduce. I need the $raw document to have timestamp and schema always available.
  2. In my $group stage, I use a custom accumulator function (see https://www.mongodb.com/docs/manual/reference/operator/aggregation/accumulator/) that reduces each document and keeps the original timestamp and schema next to the absolute maximum in its state.

Now, I encounter the following issues:

I updated my question with MongoDB playgrounds:

And for the sake of completeness using only $min and $max, but losing timestamp and schema: https://mongoplayground.net/p/UegNExWo2np

Solution 2 is about twice as fast as solution 1 on big data sets.

Am I overlooking something?

Upvotes: 0

Views: 94

Answers (2)

user9817468211
user9817468211

Reputation: 11

Now after some time and research, I want to follow up and post our solution here. We ended up using the built-in $top accumulator (docs):

{
   $top:
      {
         sortBy: { <field1>: <sort order>, <field2>: <sort order> ... },
         output: <expression>
      }
}

We used sortBy to sort by the absolute of the field's value, maintaining the original sign in the output. For us, this gave us significant performance increase over the original ideas I posted.

Upvotes: 1

Wernfried Domscheit
Wernfried Domscheit

Reputation: 59563

You can group easier with $dateTrunc. For min and max value you can use $bottom or $first

db.collection.aggregate([
  { $set: { absValue: { $abs: "$value" } } },
  {
    $group: {
      _id: {
        $dateTrunc: {
          date: "$timestamp",
          unit: "month",
          timezone: "Europe/Zurich"
        }
      },
      maxValue: {
        $bottom: {
          sortBy: { absValue: 1 },
          output: {
            timestamp: "$timestamp",
            schema: "$schema",
            value: "$value"
          }
        }
      }
    }
  }
])

For the minValue use either $top or change the sort order to sortBy: { absValue: -1 }

If you like to use $first and $last then you need to sort the entire collection, i.e.

db.collection.aggregate([
  { $set: { absValue: { $abs: "$value" } } },
  { $sort: { absValue: 1 } },
  {
    "$group": {
      "_id": {
        $dateTrunc: {
          date: "$timestamp",
          unit: "month",
          timezone: "Europe/Zurich"
        }
      },
      value: {
        $last: {
          timestamp: "$timestamp",
          schema: "$schema",
          value: "$value"
        }
      }
    }
  }
])

But I think this will be slower.

The rest is just some cosmetic.

Upvotes: 1

Related Questions