Reputation: 141
I have 2 mongo collections:
companies: Each record is a company with many fields (city, country, etc) —> 100k rows
{company_id:1, country:"USA", city:"New York",...}
{company_id:2, country:"Spain", city:"Valencia",... }
{company_id:3, country:"France", city:"Paris",... }
scores: There are blocks of dates, and every block has a company_id + score, example —> 100k rows in each block
{date: 2016-05-29, company_id:1, score:90}
{date: 2016-05-29, company_id:2, score:87}
{date: 2016-05-29, company_id:3, score:75}
...
{date: 2016-05-22, company_id:1, score:88}
{date: 2016-05-22, company_id:2, score:87}
{date: 2016-05-22, company_id:3, score:76}
...
{date: 2016-05-15, company_id:1, score:91}
{date: 2016-05-15, company_id:2, score:82}
{date: 2016-05-15, company_id:3, score:73}
...
Objective:
I want to retrieve a list of companies that can be filtered by some fields (country, city, ...) + its newest score (in 2016-05-29),ordered by score descending
That is: filter in one collection, filter + order in the other collection
Note: There’s an index on scores.date
, and we can locate/pre-calculate easily and fast the highest date (2016-05-29 in this example)
Attempts:
I’ve been trying an aggregate
query using $lookup
. When the filter is complete (and the number of companies small), the query is faster.
Query is as below:-
db.companies.aggregate([
{$match: {"status": "running", "country": "USA", "city": "San Francisco",
"categories": { $in: ["Software"]}, dummy: false}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id: "$_id",
"company_id": "$company_id",
"company_name": "$company_name",
"status": "$status",
"city": "$city",
"country": "$country",
"categories": "$categories",
"dummy": "$dummy",
"score": "$scores.score",
"date": "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})
But when the filter is small or empty (more companies), the $sort
part takes several seconds.
db.companies.aggregate([
{$match: {"status": "running"}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id: "$_id",
"company_id": "$company_id",
"company_name": "$company_name",
"status": "$status",
"city": "$city",
"country": "$country",
"categories": "$categories",
"dummy": "$dummy",
"score": "$scores.score",
"date": "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})
Probably because the number of companies that the filter find. 59 rows are easier to order than 89k
> db.companies.count({"status": "running", "country": "USA", "city": "San Francisco", "categories": { $in: ["Software"]}, dummy: false})
59
> db.companies.count({"status": "running"})
89043
I've tried a different approach, aggregating by scores, filter by date, sorting by score (the index date+score is very useful here), and everything is very fast, until the last $match
when I filter company attributes
db.scores.aggregate([
{$match:{"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort:{"score":-1}},
{$lookup:{from: "companies", localField: "company_id", foreignField: "company_id", as:"companies"}},
{$unwind:"$companies"},
{$project: {_id: "$companies._id",
"company_id": "$companies.company_id",
"company_name": "$companies.company_name",
"status": "$companies.status",
"city": "$companies.city",
"country": "$companies.country",
"categories": "$companies.categories",
"dummy": "$companies.dummy"}},
"score": "$score",
"date": "$date"
{$match:{"status": "running", "country":"USA", "city": "San Francisco",
"categories": { $in: ["Software"]}, dummy: false}}
],{allowDiskUse: true})
using this approach, the large filter (the previous example) is very slow, and the small filter (just {"status": "running"}
) is faster
Any way to join both collections, filter in both of them and order by one field?
Upvotes: 2
Views: 1540
Reputation: 11132
As I could see, there are only a couple of scores per company (not many) on various dates. So it's sort of a 1:few relationship.
So first thing that jumps into my mind: why not putting the scores in the company DB?
{ company_id:1,
country:"USA",
city:"New York",
...
scores: [
{date: 2016-05-29, score:90},
...
]
}
That way the structure is more aligned with your access pattern, you could skip the lookup part entirely. Meaning, you could define a proper index and use find()
instead of aggregation.
Apart from this, I was wondering, why you use the allowDiskUse:true
flag, 100k documents doesn't sound that many, and they should fit entirely into memory, even into the limited (128M) aggregation pipeline buffer.
To explain, why the filter (short = not very selective, long = very selective) behave differently, depending on which collection you start (scores vs companies)
So what you should check out:
Upvotes: 2