Ashwanth Madhav
Ashwanth Madhav

Reputation: 1134

Grouping nearest locations in Mongodb

Location point saved as

{
  "location_point" : {
  "coordinates" : [ 
      -95.712891, 
      37.09024
  ],
  "type" : "Point"
  },
  "location_point" : {
  "coordinates" : [ 
      -95.712893, 
      37.09024
  ],
  "type" : "Point"
  },
  "location_point" : {
  "coordinates" : [ 
      -85.712883, 
      37.09024
  ],
  "type" : "Point"
  },
  .......
  .......
}

There are several documents. I need to group it by nearest locations. After grouping first econd locations will be in one document and third one in the second. Please not that location point of first and second is not equal. Both are nearest places.

Is there any way? Thanks in advance.

Upvotes: 2

Views: 2710

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151220

The quick and lazy explanation is to use both $geoNear and $bucket aggregation pipeline stages to get the result:

.aggregate([
    {
      "$geoNear": {
        "near": {
          "type": "Point",
          "coordinates": [
            -95.712891,
            37.09024
          ]
        },
        "spherical": true,
        "distanceField": "distance",
        "distanceMultiplier": 0.001
      }
    },
    {
      "$bucket": {
        "groupBy": "$distance",
        "boundaries": [
          0, 5, 10, 20,  50,  100,  500
        ],
        "default": "greater than 500km",
        "output": {
          "count": {
            "$sum": 1
          },
          "docs": {
            "$push": "$$ROOT"
          }
        }
      }
    }
])

The longer form is that you should probably understand the "why?" part of how this solves the problem, and optionally even understand that even though this does apply at least one aggregation operator only introduced in recent MongoDB versions, this actually has all been possible right back to MongoDB 2.4.

Using $geoNear

The main thing to look for in any "grouping" is basically going to be a "distance" field added to a result of a "near" query indicating how far that result is from the coordinates used in the search. Fortunately this is exactly what the $geoNear aggregation pipeline stage does.

The basic stage would be something like this:

{
  "$geoNear": {
    "near": {
      "type": "Point",
      "coordinates": [
        -95.712891,
        37.09024
      ]
    },
    "spherical": true,
    "distanceField": "distance",
    "distanceMultiplier": 0.001
  }
},

This stage has three mandatory arguments that must be supplied:

  • near - Is the location to use for the query. This can either be in legacy coordinate pair form or as GeoJSON data. Anything as GeoJSON is basically considered in meters for results, since that is the GeoJSON standard.

  • spherical - Mandatory, but really only when the index type is 2dsphere. It defaults to false, but you probably do want a 2dsphere index for any real Geolocation data on the surface of the earth.

  • distanceField - This is also always required and it is the name of the field to be added to the document which will contain the distance from the queried location via near. This result will be in either radians or meters depending on the type of the data format used in the near argument. The result is also affected by the optional argument as noted next.

The optional argument is:

  • distanceMultiplier - This alters the result in the named field path to distanceField. The multiplier is applied to the returned value and can be used for "conversion" of units to the desired format.

    NOTE: The distanceMultiplier does NOT apply to other optional arguments like maxDistance or minDistance. Constraints applied to these optional arguments must be in the original returned units format. Therefore with GeoJSON any bounds set for "min" or "max" distances need to be calculated as meters regardless of whether you converted a distanceMultiplier value with something like km or miles.

The main thing this is going to do is simply return the "nearest" documents ( up to 100 by default ) in order of nearest to furthest away and include the field named as the distanceField within the existing document contents and that's what was mentioned earlier as the actual output which will allow you to "group".

The distanceMultiplier here is simply converting the default meters of GeoJSON to kilometers for output. If you wanted miles in the output then you would change the multiplier. i.e:

"distanceMultiplier": 0.000621371

It's totally optional, but you will need to be aware of what units ( converted or not ) are to be applied in the next "grouping" stage:


The actual "grouping" comes down to three different options depending on your available MongoDB and you actual needs:

Option 1 - $bucket

The $bucket pipeline stage was added with MongoDB 3.4. It's actually one of a few "pipeline stages" which were added in that version which are really more like a macro function or a basic form of shorthand for writing a combination of pipeline stages and actual operators. More on that later.

The main basic arguments are the groupBy expression, the boundaries which specifies the lower bounds for "grouping" ranges, and a default option which is basically applied as the *"grouping key" or _id field in output whenever data matching the groupBy expression does not fall between the entries defined with the boundaries.

    {
      "$bucket": {
        "groupBy": "$distance",
        "boundaries": [
          0, 5, 10, 20,  50,  100,  500
        ],
        "default": "greater than 500km",
        "output": {
          "count": {
            "$sum": 1
          },
          "docs": {
            "$push": "$$ROOT"
          }
        }
      }
    }

The other section is the output, which basically contains the same accumulator expressions that you would use with a $group, and that really should give you an indication of which aggregation pipeline stage this $bucket actually expands to. Those do the actual "data collection" per "grouping key".

Whilst useful, there is one small fault with $bucket in that the _id output will only ever be the values defined within the boundaries or within the default option where data falls outside of the boundaries constraint. If you wanted something "nicer", it would be typically be done in client post processing of the results, with something like:

result = result
  .map(({ _id, ...e }) =>
    ({
      _id: (!isNaN(parseFloat(_id)) && isFinite(_id))
        ? `less than ${bounds[bounds.indexOf(_id)+1]}km`
        : _id,
      ...e
    })
  );

That would replace any plain numeric values within the returned _id fields with a more meaningful "string" describing what is actually being grouped.

Note that whilst a default is "optional", you will receive a hard error in the case where any data falls outside of the boundary range. In fact the very specific error returned leads us to the next case.

Option 2 - $group and $switch

From what was said above you might have realized that the "macro translation" from the $bucket pipeline stage actually becomes a $group stage, and one that specifically applies the $switch operator as it's argument to the _id field for grouping. Again the $switch operator was introduced with MongoDB 3.4.

Essentially this is really a manual construction of what was shown above using $bucket, with a little fine tuning on the output of the _id fields and being a little less terse with the expressions that are produced by the former. In fact you can use the "explain" output of an aggregation pipeline to see something "similar" to the following listing, but using the defined pipeline stage above:

{
  "$group": {
    "_id": {
      "$switch": {
        "branches": [
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    5
                  ]
                },
                {
                  "$gte": [
                    "$distance",
                    0
                  ]
                }
              ]
            },
            "then": "less than 5km"
          },
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    10
                  ]
                }
              ]
            },
            "then": "less than 10km"
          },
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    20
                  ]
                }
              ]
            },
            "then": "less than 20km"
          },
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    50
                  ]
                }
              ]
            },
            "then": "less than 50km"
          },
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    100
                  ]
                }
              ]
            },
            "then": "less than 100km"
          },
          {
            "case": {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    500
                  ]
                }
              ]
            },
            "then": "less than 500km"
          }
        ],
        "default": "greater than 500km"
      }
    },
    "count": {
      "$sum": 1
    },
    "docs": {
      "$push": "$$ROOT"
    }
  }
}

In fact aside from the clearer "labelling" the only actual difference is the $bucket uses the $gte expression along with an $lte on every single case. This is not necessary due to how $switch actually works and how the logical conditions "fall through" just like they would in the common language counterpart usage of a switch logic block.

This is really more about a matter of personal preference to whether you are happier defining the output "strings" for the _id within the case statements or if you are okay with post processing values in order to reformat things like that.

Either way, these basically return the same output ( except there is a defined order to $bucket results ) as does our third option.

Option 3 - $group and $cond

As noted all the above is essentially based around the $switch operator, but just like it's counterpart in various programming language implementations a "switch statement" is really just a cleaner and more convenient way of writing if .. then .. else if ... and so on. MongoDB also has a if .. then .. else expression right back to MongoDB 2.2 with $cond:

{
  "$group": {
    "_id": {
      "$cond": [
        {
          "$and": [
            {
              "$lt": [
                "$distance",
                5
              ]
            },
            {
              "$gte": [
                "$distance",
                0
              ]
            }
          ]
        },
        "less then 5km",
        {
          "$cond": [
            {
              "$and": [
                {
                  "$lt": [
                    "$distance",
                    10
                  ]
                }
              ]
            },
            "less then 10km",
            {
              "$cond": [
                {
                  "$and": [
                    {
                      "$lt": [
                        "$distance",
                        20
                      ]
                    }
                  ]
                },
                "less then 20km",
                {
                  "$cond": [
                    {
                      "$and": [
                        {
                          "$lt": [
                            "$distance",
                            50
                          ]
                        }
                      ]
                    },
                    "less then 50km",
                    {
                      "$cond": [
                        {
                          "$and": [
                            {
                              "$lt": [
                                "$distance",
                                100
                              ]
                            }
                          ]
                        },
                        "less then 100km",
                        "greater than 500km"
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    },
    "count": {
      "$sum": 1
    },
    "docs": {
      "$push": {
        "_id": "$_id",
        "location_point": "$location_point",
        "distance": "$distance"
      }
    }
  }
}

Again it's really all just the same, with the main difference being that instead of a "clean array" of options to process as "cases", what you have instead is a nested set of conditions where the else just contains another $cond, right up until the end of the "boundaries" are found and then the else contains just the default value.

Since we are also at least "pretending" that we are going back as far as MongoDB 2.4 ( which is the constraint for actually running with $geoNear, then other things like $$ROOT would not be available in that version so instead you would simply name all the field expressions of the document in order to add that content with a $push.


Code Generation

All of this really should come down to that the "grouping" is actually done with the $bucket and that it's probably what you would use unless you wanted some customization of the output or if your MongoDB version did not support it ( though you probably should not be running any MongoDB under 3.4 at this present time of writing ).

Of course any other form is longer in the required syntax, but really just the same array of arguments can be applied to essentially generate and run either of the shown forms above.

An example listing ( for NodeJS ) follows which demonstrates that it's really just a simple process to generate everything here from just a simple array of bounds for the grouping, and even just a few defined options which can both be re-used within pipeline operations as well as any client pre or post processing for either generation of the pipeline instructions, or for manipulation of the returned results into a "prettier" output format.

const { Schema } = mongoose = require('mongoose');

const uri = 'mongodb://localhost:27017/test',
      options = { useNewUrlParser: true };

mongoose.set('useFindAndModify', false);
mongoose.set('useCreateIndex', true);
mongoose.set('debug', true);

const geoSchema = new Schema({
  location_point: {
    type: { type: String, enum: ["Point"], default: "Point" },
    coordinates: [Number, Number]
  }
});

geoSchema.index({ "location_point": "2dsphere" },{ background: false });

const GeoModel = mongoose.model('GeoModel', geoSchema, 'geojunk');

const [{ location_point: near }] = data = [
  [ -95.712891, 37.09024 ],
  [ -95.712893, 37.09024 ],
  [ -85.712883, 37.09024 ]
].map(coordinates => ({ location_point: { type: 'Point', coordinates } }));


const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const conn = await mongoose.connect(uri, options);

    // Clean data
    await Promise.all(
      Object.entries(conn.models).map(([k,m]) => m.deleteMany())
    );

    // Insert data
    await GeoModel.insertMany(data);

    const bounds = [ 5, 10, 20, 50, 100, 500 ];
    const distanceField = "distance";


    // Run three sample cases
    for ( let test of [0,1,2] ) {

      let pipeline = [
        { "$geoNear": {
          near,
          "spherical": true,
          distanceField,
          "distanceMultiplier": 0.001
        }},
        (() => {

          // Standard accumulators
          const output = {
            "count":  { "$sum": 1 },
            "docs": { "$push": "$$ROOT" }
          };

          switch (test) {

            case 0:
              log("Using $bucket");
              return (
                { "$bucket": {
                  "groupBy": `$${distanceField}`,
                  "boundaries": [ 0, ...bounds ],
                  "default": `greater than ${[...bounds].pop()}km`,
                  output
                }}
              );
            case  1:
              log("Manually using $switch");
              let branches = bounds.map((bound,i) =>
                ({
                  'case': {
                    '$and': [
                      { '$lt': [ `$${distanceField}`, bound ] },
                      ...((i === 0) ? [{ '$gte': [ `$${distanceField}`, 0 ] }]: [])
                    ]
                  },
                  'then': `less than ${bound}km`
                })
              );
              return (
                { "$group": {
                  "_id": {
                    "$switch": {
                      branches,
                      "default": `greater than ${[...bounds].pop()}km`
                    }
                  },
                  ...output
                }}
              );
            case 2:
              log("Legacy using $cond");
              let _id = null;

              for (let i = bounds.length -1; i > 0; i--) {
                let rec = {
                  '$cond': [
                    { '$and': [
                      { '$lt': [ `$${distanceField}`, bounds[i-1] ] },
                      ...((i == 1) ? [{ '$gte': [ `$${distanceField}`, 0 ] }] : [])
                    ]},
                    `less then ${bounds[i-1]}km`
                  ]
                };

                if ( _id == null ) {
                  rec['$cond'].push(`greater than ${bounds[i]}km`);
                } else {
                  rec['$cond'].push( _id );
                }
                _id = rec;
              }

              // Older MongoDB may require each field instead of $$ROOT
              output.docs.$push =
                ["_id", "location_point", distanceField]
                  .reduce((o,e) => ({ ...o, [e]: `$${e}` }),{});
              return ({ "$group": { _id, ...output } });

          }

        })()
      ];

      let result = await GeoModel.aggregate(pipeline);


      // Text based _id for test: 0 with $bucket
      if ( test === 0 )
        result = result
          .map(({ _id, ...e }) =>
            ({
              _id: (!isNaN(parseFloat(_id)) && isFinite(_id))
                ? `less than ${bounds[bounds.indexOf(_id)+1]}km`
                : _id,
              ...e
            })
          );

      log({ pipeline, result });

    }

  } catch (e) {
    console.error(e)
  } finally {
    mongoose.disconnect();
  }

})()

And sample output ( and of course ALL listings above are generated from this code ):

Mongoose: geojunk.createIndex({ location_point: '2dsphere' }, { background: false })
"Using $bucket"
{
  "result": [
    {
      "_id": "less than 5km",
      "count": 2,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe94",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712891,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 0
        },
        {
          "_id": "5ca897dd2efdc41b79d5fe95",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712893,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 0.00017759511720976155
        }
      ]
    },
    {
      "_id": "greater than 500km",
      "count": 1,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe96",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -85.712883,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 887.5656539981669
        }
      ]
    }
  ]
}
"Manually using $switch"
{
  "result": [
    {
      "_id": "greater than 500km",
      "count": 1,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe96",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -85.712883,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 887.5656539981669
        }
      ]
    },
    {
      "_id": "less than 5km",
      "count": 2,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe94",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712891,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 0
        },
        {
          "_id": "5ca897dd2efdc41b79d5fe95",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712893,
              37.09024
            ]
          },
          "__v": 0,
          "distance": 0.00017759511720976155
        }
      ]
    }
  ]
}
"Legacy using $cond"
{
  "result": [
    {
      "_id": "greater than 500km",
      "count": 1,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe96",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -85.712883,
              37.09024
            ]
          },
          "distance": 887.5656539981669
        }
      ]
    },
    {
      "_id": "less then 5km",
      "count": 2,
      "docs": [
        {
          "_id": "5ca897dd2efdc41b79d5fe94",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712891,
              37.09024
            ]
          },
          "distance": 0
        },
        {
          "_id": "5ca897dd2efdc41b79d5fe95",
          "location_point": {
            "type": "Point",
            "coordinates": [
              -95.712893,
              37.09024
            ]
          },
          "distance": 0.00017759511720976155
        }
      ]
    }
  ]
}

Upvotes: 8

Related Questions