Antwnis Prappas
Antwnis Prappas

Reputation: 11

How to find percentage of grouping containing a specific word

I am trying to calculate the percentage of listings in a MongoDB that contain a specific word grouped by a collection's object.

I have managed to group the count of listings containing the word but not the percentage on the total count of each group's listings.

My collection looks like this:

{
        "_id" : "103456",
        "metadata" : {
                "type" : "Bike",
                "brand" : "Siamoto",
                "model" : "Siamoto  vespa '01 - € 550 EUR (Negotiable)"
        }
},
{
        "_id" : "103457",
        "metadata" : {
                "type" : "Bike",
                "brand" : "BMW",
                "model" : "BMW ADFR '06 - € 5680 EUR"
        }
}

I want to project the percentage of ads per metadata.brand that contain the word "Negotiable" in metadata.model.

I have used for the count something like:

db.advertisements.aggregate([
  { $match: { $text: { $search: "Negotiable" } } },
  { $group: { _id: "$metadata.brand", Count: { $sum: 1} } }
])

and it worked but I can't find a workaround for the percentage. Thanks to all

Upvotes: 1

Views: 647

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151132

For what you are trying to do, using a $text search or even a $regex is the wrong approach. All these can do is return the "matching" documents only from within the collection.

Using Aggregate to Count String Matches

Whist not as flexible as a regular expression ( and sadly there is no aggregation operator equivalent at this time, but there will be in future releases. See SERVER-11947 ) the better option is to use $indexOfCP in order to match the occurrence of the "string" and then count those against the "total counts" from each grouping:

db.advertisements.aggregate([
  { "$group": {
    "_id": "$metadata.brand",
    "totalCount": { "$sum": 1 },
    "matchedCount": {
      "$sum": {
        "$cond": [{ "$ne": [{ "$indexOfCP": [ "$metadata.model", "Negotiable" ] }, -1 ] }, 1, 0]
      }
    }
  }},
  { "$addFields": {
    "percentage": {
      "$cond": {
        "if": { "$ne": [ "$matchedCount", 0 ] },
        "then": {
          "$multiply": [
            { "$divide": [ "$matchedCount", "$totalCount" ] },
            100
          ]
        },
        "else": 0
      }
    }
  }},
  { "$sort": { "percentage": -1 } }
])

And the results:

{ "_id" : "Siamoto", "totalCount" : 1, "matchedCount" : 1, "percentage" : 100 }
{ "_id" : "BMW", "totalCount" : 1, "matchedCount" : 0, "percentage" : 0 }

Note that the $group is used for the accumulation of both the total documents found within the "brand" as well as those where the string was matched. The $cond operator used here is a "ternary" or if/then/else statement which evaluates a boolean expression and then returns either one value where true or another where false. In this case the $indexOfCP NOT returning the -1 value or "not found".

The "percentage" is actually done in a separate stage, which in this case we use $addFields to add the "additional field". The operation is basically a $divide over the two accumulated values from the previous stage. The $cond is just applied to avoid "divide by 0" errors and the $multiply is just moving the decimal places into something that looks more like a "percentage". But the basic premise is such calculations which require "totals" to be accumulated first will always be a manipulation in a "later stage".

MongoDB 4.2 (proposed) Preview

FYI, on the current "unfinalized" syntax for $regexFind from MongoDB 4.2 (proposed, and yet to be finalized if included in that release ) and onwards this would be something like:

db.advertisements.aggregate([
  { "$group": {
    "_id": "$metadata.brand",
    "totalCount": { "$sum": 1 },
    "matchedCount": {
      "$sum": {
        "$cond": {
          "if": {
            "$ne": [
              { "$regexFind": {
                "input": "$metadata.model",
                "regex": /Negotiable/i
              }},
              null
            ]
          },
          "then": 1,
          "else": 0
        }
      }
    }
  }},
  { "$addFields": {
    "percentage": {
      "$cond": {
        "if": { "$ne": [ "$matchedCount", 0 ] },
        "then": {
          "$multiply": [
            { "$divide": [ "$matchedCount", "$totalCount" ] },
            100
          ]
        },
        "else": 0
      }
    }
  }},
  { "$sort": { "percentage": -1 } }
])

Again noting strongly that the "current" implementation may be subject to change by the time it is released. This is how it works on the current 4.1.9-17-g0a856820ba development release.

Using MapReduce

An alternate approach where either your MongoDB version does not support $indexOfCP OR you need more flexibility in how you "match the string" is to use mapReduce for the aggregation instead:

db.advertisements.mapReduce(
  function() {
    emit(this.metadata.brand, {
      totalCount: 1,
      matchedCount: (/Negotiable/i.test(this.metadata.model)) ? 1 : 0
    });
  },
  function(key,values) {
    var obj = { totalCount: 0, matchedCount: 0 };
    values.forEach(value => {
      obj.totalCount += value.totalCount;
      obj.matchedCount += value.matchedCount;
    });
    return obj;
  },
  {
    "out": { "inline": 1 },
    "finalize": function(key,value) {
      value.percentage = (value.matchedCount != 0)
        ? (value.matchedCount / value.totalCount) * 100
        : 0;
      return value;
    }
  }
)

This has a similar result, but in a very "mapReduce" specific way:

            {
                    "_id" : "BMW",
                    "value" : {
                            "totalCount" : 1,
                            "matchedCount" : 0,
                            "percentage" : 0
                    }
            },
            {
                    "_id" : "Siamoto",
                    "value" : {
                            "totalCount" : 1,
                            "matchedCount" : 1,
                            "percentage" : 100
                    }
            }

The logic is pretty much the same. We "emit" using the "key" for the "brand" and then use another ternary to determine whether to count a "match" or not. In this case a regular expression test() operation, and even using "case insensitive" matching as an example.

The "reducer" part simply accumulates the values that were emitted, and the finalize function is where the "percentage" is returned by the same division and multiplication process.

The main difference between the two other than basic capabilities is that the mapReduce cannot do "further things" beyond the accumulation and basic manipulation in the finalize. The "sorting" demonstrated in the aggregation pipeline cannot be done with mapReduce without outputting to a separate collection and doing a separate find() and sort() on those documents contained.


Either way works, and it just depends on your needs and the capabilities of what you have available. Of course any aggregate() approach will be much faster than using the JavaScript evaluation of mapReduce. So you probably want aggregate() as your preference where possible.

Upvotes: 1

Related Questions