MongoDB count millions of documents in related collection

Question

So, I'm quite stuck and there goes my first bullet on Stackoverflow, after years of lurking I definitely need some good advice. I have two document's types:

Article

Approximately 15k articles today, but increasing rapidly while onboarding custumers. We don't want a limit here.

{ 
    "_id" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "title" : "A neat title"
}

Activity

Approximately 1k activities per Article, written at each marketing-relevant stage of the user's navigation (for instance: view or share an Article). Driving more traffic to the web site will increase the 1/1000 ratio between Article and Activity.

{ 
    "_id" : ObjectId("5bbdae8afd529871473c1111"), 
    "article" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "what" : "view"
}
{ 
    "_id" : ObjectId("5bbdae8afd529871473c2222"), 
    "article" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "what" : "share"
}

My goal is to aggregate Articles counting the related Activities:

{ 
    "_id" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "title" : "A neat title",
    "statistics" : {
        'view':1,
        'share':1,
     }
}

the indexes on Activity.article and Activity.what are all set.

On small dataset I can easily achieve my goal with this aggregation:

db.article.aggregate([
{ $match: { 
    ... some unrelevant match
}},
{ $lookup: {
     from: "activity",
     localField: "_id",
     foreignField: "article",
     as: "activities"
}},
{ $project: {
    data: '$$ROOT',
    views: {$filter: {
        input: '$activities',
        as: 'view',
        cond: {$eq: ['$$what', 'view']}
    }},
    shares: {$filter: {
        input: '$activities',
        as: 'share',
        cond: {$eq: ['$$what', 'share']}
    }}
}},
{ $addFields: {
        'data.statistics.views': { $size: '$views' },
        'data.statistics.shares': { $size: '$shares' }
}},
{ $project: { 
    'data.activities': 0,
    'views': 0,
    'shares': 0
}},
{ $replaceRoot: { newRoot: '$data' } },
])

This gives me exactly what I want as soon as the $lookup doesn't cross the 16MB limit. If I have millions of Activities, the aggregation fails, even if the documentation states:

Aggregation Pipeline Limits The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size

What did I already try:

Adding allowDiskUse / fails, it doesn't seems to write anything since I don not see the _tmp folder in my data directory
Adding allowDiskUse + cursor / fails as well
Saving the result in a temporary collection with { $out:"result" } / fails
Changing the aggregation with a Lookup+Unwind coalescence / it works but for 1.5 million Activities the result is returned in 10 seconds because, after the Unwind, each stage of the pipeline (ie: Group back to rebuild the document) cannot use existing indexes.
Changing the Lookup using the internal pipelining / it works ~~but for 0.2 million Activities it takes 1.5 MINUTES (I stopped the test for 1.5 million)~~ and results are returned in 6 seconds. This is probably my best horse...

I even tried something like this:

db.article.aggregate([
    { $match: { 
        ...
    }},
    { $addFields: {'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count()
])

that works wonderfully (0.008 secs/Article). The problem is that I cannot "variabilize" that ObjectId:

db.article.aggregate([
    { $match: { 
            ...
    }},
    { $addFields: {

            'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count(),
// ^ returns correct count

            'statistics.querystring': { $let: {
            vars:   { articleid: "$_id", whatvalue: 'view' },
            in:     { 'query':{ $concat: [ "db.activity.find( { 'article': ObjectId('", { $toString: "$$articleid" }, "'), 'what' : '", "$$whatvalue", "' } ).count()" ] } }
            }},
// ^ returns correct query to string


            'statistics.variablequery': { $let: {
            vars: { articleid: "$_id", whatvalue: 'view' },
            in:  db.activity.find( { "article": '$$articleid', "what" : "$$whatvalue" } ).count()
            }},
// ^ returns 0

    }}
])

I am open to every solution, changing my collections is possible even if I excluded the possibility to increase a counter inside my Article when writing an Activity because I need to filter by date (ie: give me all the shares of the last week)

Nic Cottrell · Accepted Answer

How big are the Activity documents going to be? Since they seem small - I would keep the activity as an array inside the Article document. The document limit is 16mb so that should be fine and you can avoid the _id and duplicate article id fields on disk - making much smaller data on disk. Remember MongoDB isn’t your traditional SQL database - embedded fields and documents are your friend.

If activity is going to be unbounded (i.e. can grow forever) then I suggest a bucketing approach, where you have one Activity document per Article per Day, something like:

{ 
    "_id" : {
       "article" : ObjectId("5bbdae8afd529871473c2222"),
       "when": "2018-12-27"
    },
    "activities" : [
       {"what": "view", "when": "12:01"},
       {"what": "share", "when": "13:16"}
    ]
}

You could store full timestamps or ISODates in the "when" fields, but this approach is more readable and probably more compact on disk.

MongoDB count millions of documents in related collection

Answers (1)

Related Questions