Reputation: 11
I am trying to calculate the percentage of listings in a MongoDB that contain a specific word grouped by a collection's object.
I have managed to group the count of listings containing the word but not the percentage on the total count of each group's listings.
My collection looks like this:
{
"_id" : "103456",
"metadata" : {
"type" : "Bike",
"brand" : "Siamoto",
"model" : "Siamoto vespa '01 - € 550 EUR (Negotiable)"
}
},
{
"_id" : "103457",
"metadata" : {
"type" : "Bike",
"brand" : "BMW",
"model" : "BMW ADFR '06 - € 5680 EUR"
}
}
I want to project the percentage of ads per metadata.brand that contain the word "Negotiable" in metadata.model.
I have used for the count something like:
db.advertisements.aggregate([
{ $match: { $text: { $search: "Negotiable" } } },
{ $group: { _id: "$metadata.brand", Count: { $sum: 1} } }
])
and it worked but I can't find a workaround for the percentage. Thanks to all
Upvotes: 1
Views: 647
Reputation: 151132
For what you are trying to do, using a $text
search or even a $regex
is the wrong approach. All these can do is return the "matching" documents only from within the collection.
Whist not as flexible as a regular expression ( and sadly there is no aggregation operator equivalent at this time, but there will be in future releases. See SERVER-11947 ) the better option is to use $indexOfCP
in order to match the occurrence of the "string" and then count those against the "total counts" from each grouping:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": [{ "$ne": [{ "$indexOfCP": [ "$metadata.model", "Negotiable" ] }, -1 ] }, 1, 0]
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
And the results:
{ "_id" : "Siamoto", "totalCount" : 1, "matchedCount" : 1, "percentage" : 100 }
{ "_id" : "BMW", "totalCount" : 1, "matchedCount" : 0, "percentage" : 0 }
Note that the $group
is used for the accumulation of both the total documents found within the "brand"
as well as those where the string was matched. The $cond
operator used here is a "ternary" or if/then/else
statement which evaluates a boolean expression and then returns either one value where true
or another where false
. In this case the $indexOfCP
NOT returning the -1
value or "not found".
The "percentage" is actually done in a separate stage, which in this case we use $addFields
to add the "additional field". The operation is basically a $divide
over the two accumulated values from the previous stage. The $cond
is just applied to avoid "divide by 0" errors and the $multiply
is just moving the decimal places into something that looks more like a "percentage". But the basic premise is such calculations which require "totals" to be accumulated first will always be a manipulation in a "later stage".
MongoDB 4.2 (proposed) Preview
FYI, on the current "unfinalized" syntax for $regexFind
from MongoDB 4.2 (proposed, and yet to be finalized if included in that release ) and onwards this would be something like:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": {
"if": {
"$ne": [
{ "$regexFind": {
"input": "$metadata.model",
"regex": /Negotiable/i
}},
null
]
},
"then": 1,
"else": 0
}
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
Again noting strongly that the "current" implementation may be subject to change by the time it is released. This is how it works on the current 4.1.9-17-g0a856820ba
development release.
An alternate approach where either your MongoDB version does not support $indexOfCP
OR you need more flexibility in how you "match the string" is to use mapReduce
for the aggregation instead:
db.advertisements.mapReduce(
function() {
emit(this.metadata.brand, {
totalCount: 1,
matchedCount: (/Negotiable/i.test(this.metadata.model)) ? 1 : 0
});
},
function(key,values) {
var obj = { totalCount: 0, matchedCount: 0 };
values.forEach(value => {
obj.totalCount += value.totalCount;
obj.matchedCount += value.matchedCount;
});
return obj;
},
{
"out": { "inline": 1 },
"finalize": function(key,value) {
value.percentage = (value.matchedCount != 0)
? (value.matchedCount / value.totalCount) * 100
: 0;
return value;
}
}
)
This has a similar result, but in a very "mapReduce" specific way:
{
"_id" : "BMW",
"value" : {
"totalCount" : 1,
"matchedCount" : 0,
"percentage" : 0
}
},
{
"_id" : "Siamoto",
"value" : {
"totalCount" : 1,
"matchedCount" : 1,
"percentage" : 100
}
}
The logic is pretty much the same. We "emit" using the "key" for the "brand"
and then use another ternary to determine whether to count a "match" or not. In this case a regular expression test()
operation, and even using "case insensitive" matching as an example.
The "reducer" part simply accumulates the values that were emitted, and the finalize
function is where the "percentage" is returned by the same division and multiplication process.
The main difference between the two other than basic capabilities is that the mapReduce
cannot do "further things" beyond the accumulation and basic manipulation in the finalize
. The "sorting" demonstrated in the aggregation pipeline cannot be done with mapReduce
without outputting to a separate collection and doing a separate find()
and sort()
on those documents contained.
Either way works, and it just depends on your needs and the capabilities of what you have available. Of course any aggregate()
approach will be much faster than using the JavaScript
evaluation of mapReduce
. So you probably want aggregate()
as your preference where possible.
Upvotes: 1