MongoDB count millions of documents in related collection

So, I'm quite stuck and there goes my first bullet on Stackoverflow, after years of lurking I definitely need some good advice. I have two document's types:

Article

Approximately 15k articles today, but increasing rapidly while onboarding custumers. We don't want a limit here.

{ 
    "_id" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "title" : "A neat title"
}

Activity

Approximately 1k activities per Article, written at each marketing-relevant stage of the user's navigation (for instance: view or share an Article). Driving more traffic to the web site will increase the 1/1000 ratio between Article and Activity.

{ 
    "_id" : ObjectId("5bbdae8afd529871473c1111"), 
    "article" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "what" : "view"
}
{ 
    "_id" : ObjectId("5bbdae8afd529871473c2222"), 
    "article" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "what" : "share"
}

My goal is to aggregate Articles counting the related Activities:

{ 
    "_id" : ObjectId("5bd054d8fd5298d07ddc293a"), 
    "title" : "A neat title",
    "statistics" : {
        'view':1,
        'share':1,
     }
}

the indexes on Activity.article and Activity.what are all set.

On small dataset I can easily achieve my goal with this aggregation:

db.article.aggregate([
{ $match: { 
    ... some unrelevant match
}},
{ $lookup: {
     from: "activity",
     localField: "_id",
     foreignField: "article",
     as: "activities"
}},
{ $project: {
    data: '$$ROOT',
    views: {$filter: {
        input: '$activities',
        as: 'view',
        cond: {$eq: ['$$what', 'view']}
    }},
    shares: {$filter: {
        input: '$activities',
        as: 'share',
        cond: {$eq: ['$$what', 'share']}
    }}
}},
{ $addFields: {
        'data.statistics.views': { $size: '$views' },
        'data.statistics.shares': { $size: '$shares' }
}},
{ $project: { 
    'data.activities': 0,
    'views': 0,
    'shares': 0
}},
{ $replaceRoot: { newRoot: '$data' } },
])

This gives me exactly what I want as soon as the $lookup doesn't cross the 16MB limit. If I have millions of Activities, the aggregation fails, even if the documentation states:

Aggregation Pipeline Limits The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size

What did I already try:

  1. Adding allowDiskUse / fails, it doesn't seems to write anything since I don not see the _tmp folder in my data directory
  2. Adding allowDiskUse + cursor / fails as well
  3. Saving the result in a temporary collection with { $out:"result" } / fails
  4. Changing the aggregation with a Lookup+Unwind coalescence / it works but for 1.5 million Activities the result is returned in 10 seconds because, after the Unwind, each stage of the pipeline (ie: Group back to rebuild the document) cannot use existing indexes.
  5. Changing the Lookup using the internal pipelining / it works but for 0.2 million Activities it takes 1.5 MINUTES (I stopped the test for 1.5 million) and results are returned in 6 seconds. This is probably my best horse...

I even tried something like this:

db.article.aggregate([
    { $match: { 
        ...
    }},
    { $addFields: {'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count()
])

that works wonderfully (0.008 secs/Article). The problem is that I cannot "variabilize" that ObjectId:

db.article.aggregate([
    { $match: { 
            ...
    }},
    { $addFields: {

            'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count(),
// ^ returns correct count

            'statistics.querystring': { $let: {
            vars:   { articleid: "$_id", whatvalue: 'view' },
            in:     { 'query':{ $concat: [ "db.activity.find( { 'article': ObjectId('", { $toString: "$$articleid" }, "'), 'what' : '", "$$whatvalue", "' } ).count()" ] } }
            }},
// ^ returns correct query to string


            'statistics.variablequery': { $let: {
            vars: { articleid: "$_id", whatvalue: 'view' },
            in:  db.activity.find( { "article": '$$articleid', "what" : "$$whatvalue" } ).count()
            }},
// ^ returns 0

    }}
])

I am open to every solution, changing my collections is possible even if I excluded the possibility to increase a counter inside my Article when writing an Activity because I need to filter by date (ie: give me all the shares of the last week)

Upvotes: 2

Views: 1309

Answers (1)

Nic Cottrell
Nic Cottrell

Reputation: 9695

How big are the Activity documents going to be? Since they seem small - I would keep the activity as an array inside the Article document. The document limit is 16mb so that should be fine and you can avoid the _id and duplicate article id fields on disk - making much smaller data on disk. Remember MongoDB isn’t your traditional SQL database - embedded fields and documents are your friend.

If activity is going to be unbounded (i.e. can grow forever) then I suggest a bucketing approach, where you have one Activity document per Article per Day, something like:

{ 
    "_id" : {
       "article" : ObjectId("5bbdae8afd529871473c2222"),
       "when": "2018-12-27"
    },
    "activities" : [
       {"what": "view", "when": "12:01"},
       {"what": "share", "when": "13:16"}
    ]
}

You could store full timestamps or ISODates in the "when" fields, but this approach is more readable and probably more compact on disk.

Upvotes: 2

Related Questions