Reputation: 43
So, I'm quite stuck and there goes my first bullet on Stackoverflow, after years of lurking I definitely need some good advice. I have two document's types:
Article
Approximately 15k articles today, but increasing rapidly while onboarding custumers. We don't want a limit here.
{
"_id" : ObjectId("5bd054d8fd5298d07ddc293a"),
"title" : "A neat title"
}
Activity
Approximately 1k activities per Article, written at each marketing-relevant stage of the user's navigation (for instance: view or share an Article). Driving more traffic to the web site will increase the 1/1000 ratio between Article and Activity.
{
"_id" : ObjectId("5bbdae8afd529871473c1111"),
"article" : ObjectId("5bd054d8fd5298d07ddc293a"),
"what" : "view"
}
{
"_id" : ObjectId("5bbdae8afd529871473c2222"),
"article" : ObjectId("5bd054d8fd5298d07ddc293a"),
"what" : "share"
}
My goal is to aggregate Articles counting the related Activities:
{
"_id" : ObjectId("5bd054d8fd5298d07ddc293a"),
"title" : "A neat title",
"statistics" : {
'view':1,
'share':1,
}
}
the indexes on Activity.article and Activity.what are all set.
On small dataset I can easily achieve my goal with this aggregation:
db.article.aggregate([
{ $match: {
... some unrelevant match
}},
{ $lookup: {
from: "activity",
localField: "_id",
foreignField: "article",
as: "activities"
}},
{ $project: {
data: '$$ROOT',
views: {$filter: {
input: '$activities',
as: 'view',
cond: {$eq: ['$$what', 'view']}
}},
shares: {$filter: {
input: '$activities',
as: 'share',
cond: {$eq: ['$$what', 'share']}
}}
}},
{ $addFields: {
'data.statistics.views': { $size: '$views' },
'data.statistics.shares': { $size: '$shares' }
}},
{ $project: {
'data.activities': 0,
'views': 0,
'shares': 0
}},
{ $replaceRoot: { newRoot: '$data' } },
])
This gives me exactly what I want as soon as the $lookup doesn't cross the 16MB limit. If I have millions of Activities, the aggregation fails, even if the documentation states:
Aggregation Pipeline Limits The limit only applies to the returned documents; during the pipeline processing, the documents may exceed this size
What did I already try:
I even tried something like this:
db.article.aggregate([
{ $match: {
...
}},
{ $addFields: {'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count()
])
that works wonderfully (0.008 secs/Article). The problem is that I cannot "variabilize" that ObjectId:
db.article.aggregate([
{ $match: {
...
}},
{ $addFields: {
'statistics.views': db.activity.find({ "article": ObjectId('5bd054d8fd5298d07ddc293a'), "what" : "view" }).count(),
// ^ returns correct count
'statistics.querystring': { $let: {
vars: { articleid: "$_id", whatvalue: 'view' },
in: { 'query':{ $concat: [ "db.activity.find( { 'article': ObjectId('", { $toString: "$$articleid" }, "'), 'what' : '", "$$whatvalue", "' } ).count()" ] } }
}},
// ^ returns correct query to string
'statistics.variablequery': { $let: {
vars: { articleid: "$_id", whatvalue: 'view' },
in: db.activity.find( { "article": '$$articleid', "what" : "$$whatvalue" } ).count()
}},
// ^ returns 0
}}
])
I am open to every solution, changing my collections is possible even if I excluded the possibility to increase a counter inside my Article when writing an Activity because I need to filter by date (ie: give me all the shares of the last week)
Upvotes: 2
Views: 1309
Reputation: 9695
How big are the Activity documents going to be? Since they seem small - I would keep the activity as an array inside the Article document. The document limit is 16mb so that should be fine and you can avoid the _id and duplicate article id fields on disk - making much smaller data on disk. Remember MongoDB isn’t your traditional SQL database - embedded fields and documents are your friend.
If activity is going to be unbounded (i.e. can grow forever) then I suggest a bucketing approach, where you have one Activity document per Article per Day, something like:
{
"_id" : {
"article" : ObjectId("5bbdae8afd529871473c2222"),
"when": "2018-12-27"
},
"activities" : [
{"what": "view", "when": "12:01"},
{"what": "share", "when": "13:16"}
]
}
You could store full timestamps or ISODates in the "when" fields, but this approach is more readable and probably more compact on disk.
Upvotes: 2