Reputation: 1134
Location point saved as
{
"location_point" : {
"coordinates" : [
-95.712891,
37.09024
],
"type" : "Point"
},
"location_point" : {
"coordinates" : [
-95.712893,
37.09024
],
"type" : "Point"
},
"location_point" : {
"coordinates" : [
-85.712883,
37.09024
],
"type" : "Point"
},
.......
.......
}
There are several documents. I need to group
it by nearest locations.
After grouping first econd locations will be in one document and third one in the second.
Please not that location point of first and second is not equal. Both are nearest places.
Is there any way? Thanks in advance.
Upvotes: 2
Views: 2710
Reputation: 151220
The quick and lazy explanation is to use both $geoNear
and $bucket
aggregation pipeline stages to get the result:
.aggregate([
{
"$geoNear": {
"near": {
"type": "Point",
"coordinates": [
-95.712891,
37.09024
]
},
"spherical": true,
"distanceField": "distance",
"distanceMultiplier": 0.001
}
},
{
"$bucket": {
"groupBy": "$distance",
"boundaries": [
0, 5, 10, 20, 50, 100, 500
],
"default": "greater than 500km",
"output": {
"count": {
"$sum": 1
},
"docs": {
"$push": "$$ROOT"
}
}
}
}
])
The longer form is that you should probably understand the "why?" part of how this solves the problem, and optionally even understand that even though this does apply at least one aggregation operator only introduced in recent MongoDB versions, this actually has all been possible right back to MongoDB 2.4.
The main thing to look for in any "grouping" is basically going to be a "distance"
field added to a result of a "near" query indicating how far that result is from the coordinates used in the search. Fortunately this is exactly what the $geoNear
aggregation pipeline stage does.
The basic stage would be something like this:
{
"$geoNear": {
"near": {
"type": "Point",
"coordinates": [
-95.712891,
37.09024
]
},
"spherical": true,
"distanceField": "distance",
"distanceMultiplier": 0.001
}
},
This stage has three mandatory arguments that must be supplied:
near - Is the location to use for the query. This can either be in legacy coordinate pair form or as GeoJSON data. Anything as GeoJSON is basically considered in meters for results, since that is the GeoJSON standard.
spherical - Mandatory, but really only when the index type is 2dsphere
. It defaults to false
, but you probably do want a 2dsphere
index for any real Geolocation data on the surface of the earth.
distanceField - This is also always required and it is the name of the field to be added to the document which will contain the distance from the queried location via near
. This result will be in either radians or meters depending on the type of the data format used in the near
argument. The result is also affected by the optional argument as noted next.
The optional argument is:
distanceMultiplier - This alters the result in the named field path to distanceField
. The multiplier is applied to the returned value and can be used for "conversion" of units to the desired format.
NOTE: The
distanceMultiplier
does NOT apply to other optional arguments likemaxDistance
orminDistance
. Constraints applied to these optional arguments must be in the original returned units format. Therefore with GeoJSON any bounds set for "min" or "max" distances need to be calculated as meters regardless of whether you converted adistanceMultiplier
value with something likekm
ormiles
.
The main thing this is going to do is simply return the "nearest" documents ( up to 100 by default ) in order of nearest to furthest away and include the field named as the distanceField
within the existing document contents and that's what was mentioned earlier as the actual output which will allow you to "group".
The distanceMultiplier
here is simply converting the default meters of GeoJSON to kilometers for output. If you wanted miles in the output then you would change the multiplier. i.e:
"distanceMultiplier": 0.000621371
It's totally optional, but you will need to be aware of what units ( converted or not ) are to be applied in the next "grouping" stage:
The actual "grouping" comes down to three different options depending on your available MongoDB and you actual needs:
The $bucket
pipeline stage was added with MongoDB 3.4. It's actually one of a few "pipeline stages" which were added in that version which are really more like a macro function or a basic form of shorthand for writing a combination of pipeline stages and actual operators. More on that later.
The main basic arguments are the groupBy
expression, the boundaries
which specifies the lower bounds for "grouping" ranges, and a default
option which is basically applied as the *"grouping key" or _id
field in output whenever data matching the groupBy
expression does not fall between the entries defined with the boundaries
.
{
"$bucket": {
"groupBy": "$distance",
"boundaries": [
0, 5, 10, 20, 50, 100, 500
],
"default": "greater than 500km",
"output": {
"count": {
"$sum": 1
},
"docs": {
"$push": "$$ROOT"
}
}
}
}
The other section is the output
, which basically contains the same accumulator expressions that you would use with a $group
, and that really should give you an indication of which aggregation pipeline stage this $bucket
actually expands to. Those do the actual "data collection" per "grouping key".
Whilst useful, there is one small fault with $bucket
in that the _id
output will only ever be the values defined within the boundaries
or within the default
option where data falls outside of the boundaries
constraint. If you wanted something "nicer", it would be typically be done in client post processing of the results, with something like:
result = result
.map(({ _id, ...e }) =>
({
_id: (!isNaN(parseFloat(_id)) && isFinite(_id))
? `less than ${bounds[bounds.indexOf(_id)+1]}km`
: _id,
...e
})
);
That would replace any plain numeric values within the returned _id
fields with a more meaningful "string" describing what is actually being grouped.
Note that whilst a default
is "optional", you will receive a hard error in the case where any data falls outside of the boundary range. In fact the very specific error returned leads us to the next case.
From what was said above you might have realized that the "macro translation" from the $bucket
pipeline stage actually becomes a $group
stage, and one that specifically applies the $switch
operator as it's argument to the _id
field for grouping. Again the $switch
operator was introduced with MongoDB 3.4.
Essentially this is really a manual construction of what was shown above using $bucket
, with a little fine tuning on the output of the _id
fields and being a little less terse with the expressions that are produced by the former. In fact you can use the "explain" output of an aggregation pipeline to see something "similar" to the following listing, but using the defined pipeline stage above:
{
"$group": {
"_id": {
"$switch": {
"branches": [
{
"case": {
"$and": [
{
"$lt": [
"$distance",
5
]
},
{
"$gte": [
"$distance",
0
]
}
]
},
"then": "less than 5km"
},
{
"case": {
"$and": [
{
"$lt": [
"$distance",
10
]
}
]
},
"then": "less than 10km"
},
{
"case": {
"$and": [
{
"$lt": [
"$distance",
20
]
}
]
},
"then": "less than 20km"
},
{
"case": {
"$and": [
{
"$lt": [
"$distance",
50
]
}
]
},
"then": "less than 50km"
},
{
"case": {
"$and": [
{
"$lt": [
"$distance",
100
]
}
]
},
"then": "less than 100km"
},
{
"case": {
"$and": [
{
"$lt": [
"$distance",
500
]
}
]
},
"then": "less than 500km"
}
],
"default": "greater than 500km"
}
},
"count": {
"$sum": 1
},
"docs": {
"$push": "$$ROOT"
}
}
}
In fact aside from the clearer "labelling" the only actual difference is the $bucket
uses the $gte
expression along with an $lte
on every single case
. This is not necessary due to how $switch
actually works and how the logical conditions "fall through" just like they would in the common language counterpart usage of a switch
logic block.
This is really more about a matter of personal preference to whether you are happier defining the output "strings" for the _id
within the case
statements or if you are okay with post processing values in order to reformat things like that.
Either way, these basically return the same output ( except there is a defined order to $bucket
results ) as does our third option.
As noted all the above is essentially based around the $switch
operator, but just like it's counterpart in various programming language implementations a "switch statement" is really just a cleaner and more convenient way of writing if .. then .. else if ...
and so on. MongoDB also has a if .. then .. else
expression right back to MongoDB 2.2 with $cond
:
{
"$group": {
"_id": {
"$cond": [
{
"$and": [
{
"$lt": [
"$distance",
5
]
},
{
"$gte": [
"$distance",
0
]
}
]
},
"less then 5km",
{
"$cond": [
{
"$and": [
{
"$lt": [
"$distance",
10
]
}
]
},
"less then 10km",
{
"$cond": [
{
"$and": [
{
"$lt": [
"$distance",
20
]
}
]
},
"less then 20km",
{
"$cond": [
{
"$and": [
{
"$lt": [
"$distance",
50
]
}
]
},
"less then 50km",
{
"$cond": [
{
"$and": [
{
"$lt": [
"$distance",
100
]
}
]
},
"less then 100km",
"greater than 500km"
]
}
]
}
]
}
]
}
]
},
"count": {
"$sum": 1
},
"docs": {
"$push": {
"_id": "$_id",
"location_point": "$location_point",
"distance": "$distance"
}
}
}
}
Again it's really all just the same, with the main difference being that instead of a "clean array" of options to process as "cases", what you have instead is a nested set of conditions where the else
just contains another $cond
, right up until the end of the "boundaries" are found and then the else
contains just the default
value.
Since we are also at least "pretending" that we are going back as far as MongoDB 2.4 ( which is the constraint for actually running with $geoNear
, then other things like $$ROOT
would not be available in that version so instead you would simply name all the field expressions of the document in order to add that content with a $push
.
All of this really should come down to that the "grouping" is actually done with the $bucket
and that it's probably what you would use unless you wanted some customization of the output or if your MongoDB version did not support it ( though you probably should not be running any MongoDB under 3.4 at this present time of writing ).
Of course any other form is longer in the required syntax, but really just the same array of arguments can be applied to essentially generate and run either of the shown forms above.
An example listing ( for NodeJS ) follows which demonstrates that it's really just a simple process to generate everything here from just a simple array of bounds
for the grouping, and even just a few defined options which can both be re-used within pipeline operations as well as any client pre or post processing for either generation of the pipeline instructions, or for manipulation of the returned results into a "prettier" output format.
const { Schema } = mongoose = require('mongoose');
const uri = 'mongodb://localhost:27017/test',
options = { useNewUrlParser: true };
mongoose.set('useFindAndModify', false);
mongoose.set('useCreateIndex', true);
mongoose.set('debug', true);
const geoSchema = new Schema({
location_point: {
type: { type: String, enum: ["Point"], default: "Point" },
coordinates: [Number, Number]
}
});
geoSchema.index({ "location_point": "2dsphere" },{ background: false });
const GeoModel = mongoose.model('GeoModel', geoSchema, 'geojunk');
const [{ location_point: near }] = data = [
[ -95.712891, 37.09024 ],
[ -95.712893, 37.09024 ],
[ -85.712883, 37.09024 ]
].map(coordinates => ({ location_point: { type: 'Point', coordinates } }));
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const conn = await mongoose.connect(uri, options);
// Clean data
await Promise.all(
Object.entries(conn.models).map(([k,m]) => m.deleteMany())
);
// Insert data
await GeoModel.insertMany(data);
const bounds = [ 5, 10, 20, 50, 100, 500 ];
const distanceField = "distance";
// Run three sample cases
for ( let test of [0,1,2] ) {
let pipeline = [
{ "$geoNear": {
near,
"spherical": true,
distanceField,
"distanceMultiplier": 0.001
}},
(() => {
// Standard accumulators
const output = {
"count": { "$sum": 1 },
"docs": { "$push": "$$ROOT" }
};
switch (test) {
case 0:
log("Using $bucket");
return (
{ "$bucket": {
"groupBy": `$${distanceField}`,
"boundaries": [ 0, ...bounds ],
"default": `greater than ${[...bounds].pop()}km`,
output
}}
);
case 1:
log("Manually using $switch");
let branches = bounds.map((bound,i) =>
({
'case': {
'$and': [
{ '$lt': [ `$${distanceField}`, bound ] },
...((i === 0) ? [{ '$gte': [ `$${distanceField}`, 0 ] }]: [])
]
},
'then': `less than ${bound}km`
})
);
return (
{ "$group": {
"_id": {
"$switch": {
branches,
"default": `greater than ${[...bounds].pop()}km`
}
},
...output
}}
);
case 2:
log("Legacy using $cond");
let _id = null;
for (let i = bounds.length -1; i > 0; i--) {
let rec = {
'$cond': [
{ '$and': [
{ '$lt': [ `$${distanceField}`, bounds[i-1] ] },
...((i == 1) ? [{ '$gte': [ `$${distanceField}`, 0 ] }] : [])
]},
`less then ${bounds[i-1]}km`
]
};
if ( _id == null ) {
rec['$cond'].push(`greater than ${bounds[i]}km`);
} else {
rec['$cond'].push( _id );
}
_id = rec;
}
// Older MongoDB may require each field instead of $$ROOT
output.docs.$push =
["_id", "location_point", distanceField]
.reduce((o,e) => ({ ...o, [e]: `$${e}` }),{});
return ({ "$group": { _id, ...output } });
}
})()
];
let result = await GeoModel.aggregate(pipeline);
// Text based _id for test: 0 with $bucket
if ( test === 0 )
result = result
.map(({ _id, ...e }) =>
({
_id: (!isNaN(parseFloat(_id)) && isFinite(_id))
? `less than ${bounds[bounds.indexOf(_id)+1]}km`
: _id,
...e
})
);
log({ pipeline, result });
}
} catch (e) {
console.error(e)
} finally {
mongoose.disconnect();
}
})()
And sample output ( and of course ALL listings above are generated from this code ):
Mongoose: geojunk.createIndex({ location_point: '2dsphere' }, { background: false })
"Using $bucket"
{
"result": [
{
"_id": "less than 5km",
"count": 2,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe94",
"location_point": {
"type": "Point",
"coordinates": [
-95.712891,
37.09024
]
},
"__v": 0,
"distance": 0
},
{
"_id": "5ca897dd2efdc41b79d5fe95",
"location_point": {
"type": "Point",
"coordinates": [
-95.712893,
37.09024
]
},
"__v": 0,
"distance": 0.00017759511720976155
}
]
},
{
"_id": "greater than 500km",
"count": 1,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe96",
"location_point": {
"type": "Point",
"coordinates": [
-85.712883,
37.09024
]
},
"__v": 0,
"distance": 887.5656539981669
}
]
}
]
}
"Manually using $switch"
{
"result": [
{
"_id": "greater than 500km",
"count": 1,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe96",
"location_point": {
"type": "Point",
"coordinates": [
-85.712883,
37.09024
]
},
"__v": 0,
"distance": 887.5656539981669
}
]
},
{
"_id": "less than 5km",
"count": 2,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe94",
"location_point": {
"type": "Point",
"coordinates": [
-95.712891,
37.09024
]
},
"__v": 0,
"distance": 0
},
{
"_id": "5ca897dd2efdc41b79d5fe95",
"location_point": {
"type": "Point",
"coordinates": [
-95.712893,
37.09024
]
},
"__v": 0,
"distance": 0.00017759511720976155
}
]
}
]
}
"Legacy using $cond"
{
"result": [
{
"_id": "greater than 500km",
"count": 1,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe96",
"location_point": {
"type": "Point",
"coordinates": [
-85.712883,
37.09024
]
},
"distance": 887.5656539981669
}
]
},
{
"_id": "less then 5km",
"count": 2,
"docs": [
{
"_id": "5ca897dd2efdc41b79d5fe94",
"location_point": {
"type": "Point",
"coordinates": [
-95.712891,
37.09024
]
},
"distance": 0
},
{
"_id": "5ca897dd2efdc41b79d5fe95",
"location_point": {
"type": "Point",
"coordinates": [
-95.712893,
37.09024
]
},
"distance": 0.00017759511720976155
}
]
}
]
}
Upvotes: 8