czuriaga
czuriaga

Reputation: 141

MongoDB query joining two collections

I have 2 mongo collections:

companies: Each record is a company with many fields (city, country, etc) —> 100k rows

{company_id:1, country:"USA", city:"New York",...}
{company_id:2, country:"Spain", city:"Valencia",... }
{company_id:3, country:"France", city:"Paris",... }

scores: There are blocks of dates, and every block has a company_id + score, example —> 100k rows in each block

{date: 2016-05-29, company_id:1, score:90}
{date: 2016-05-29, company_id:2, score:87}
{date: 2016-05-29, company_id:3, score:75}
...
{date: 2016-05-22, company_id:1, score:88}
{date: 2016-05-22, company_id:2, score:87}
{date: 2016-05-22, company_id:3, score:76}
...
{date: 2016-05-15, company_id:1, score:91}
{date: 2016-05-15, company_id:2, score:82}
{date: 2016-05-15, company_id:3, score:73}
...

Objective:

I want to retrieve a list of companies that can be filtered by some fields (country, city, ...) + its newest score (in 2016-05-29),ordered by score descending

That is: filter in one collection, filter + order in the other collection

Note: There’s an index on scores.date, and we can locate/pre-calculate easily and fast the highest date (2016-05-29 in this example)

Attempts:

I’ve been trying an aggregate query using $lookup. When the filter is complete (and the number of companies small), the query is faster.

Query is as below:-

db.companies.aggregate([
{$match: {"status": "running", "country": "USA", "city": "San Francisco",
         "categories": { $in: ["Software"]}, dummy: false}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id:            "$_id",
            "company_id":   "$company_id",
            "company_name": "$company_name",
            "status":       "$status",
            "city":         "$city",
            "country":      "$country",
            "categories":   "$categories",
            "dummy":        "$dummy",
            "score":        "$scores.score",
            "date":         "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})

But when the filter is small or empty (more companies), the $sort part takes several seconds.

db.companies.aggregate([
{$match: {"status": "running"}},
{$lookup: {from: "scores", localField: "company_id", foreignField: "company_id", as:"scores"}},
{$unwind: "$scores"},
{$project: {_id:            "$_id",
            "company_id":   "$company_id",
            "company_name": "$company_name",
            "status":       "$status",
            "city":         "$city",
            "country":      "$country",
            "categories":   "$categories",
            "dummy":        "$dummy",
            "score":        "$scores.score",
            "date":         "$scores.date"}},
{$match: {"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort: {"score":-1}}
],{allowDiskUse: true})

Probably because the number of companies that the filter find. 59 rows are easier to order than 89k

> db.companies.count({"status": "running", "country": "USA", "city": "San Francisco", "categories": { $in: ["Software"]}, dummy: false})
59
> db.companies.count({"status": "running"})
89043

I've tried a different approach, aggregating by scores, filter by date, sorting by score (the index date+score is very useful here), and everything is very fast, until the last $match when I filter company attributes

db.scores.aggregate([
{$match:{"date" : ISODate("2016-05-29T00:00:00Z")}},
{$sort:{"score":-1}},
{$lookup:{from: "companies", localField: "company_id", foreignField: "company_id", as:"companies"}},
{$unwind:"$companies"},
{$project: {_id:             "$companies._id",
            "company_id":    "$companies.company_id",
            "company_name":  "$companies.company_name",
            "status":        "$companies.status",
            "city":          "$companies.city",
            "country":       "$companies.country",
            "categories":    "$companies.categories",
            "dummy":         "$companies.dummy"}},
            "score":         "$score",
            "date":          "$date"
{$match:{"status": "running", "country":"USA", "city": "San Francisco",
         "categories": { $in: ["Software"]}, dummy: false}}
],{allowDiskUse: true})

using this approach, the large filter (the previous example) is very slow, and the small filter (just {"status": "running"}) is faster

Any way to join both collections, filter in both of them and order by one field?

Upvotes: 2

Views: 1540

Answers (1)

Gerald Mücke
Gerald Mücke

Reputation: 11132

As I could see, there are only a couple of scores per company (not many) on various dates. So it's sort of a 1:few relationship.

So first thing that jumps into my mind: why not putting the scores in the company DB?

{ company_id:1, 
  country:"USA", 
  city:"New York",
  ...
  scores: [
    {date: 2016-05-29, score:90},
    ...
  ]
}

That way the structure is more aligned with your access pattern, you could skip the lookup part entirely. Meaning, you could define a proper index and use find() instead of aggregation.

Apart from this, I was wondering, why you use the allowDiskUse:true flag, 100k documents doesn't sound that many, and they should fit entirely into memory, even into the limited (128M) aggregation pipeline buffer.

To explain, why the filter (short = not very selective, long = very selective) behave differently, depending on which collection you start (scores vs companies)

  • company first:
    • short filter: a lot of companies match the criteria, hence a lot of companies have to be sorted (you need them all in memory to sort). If part of the result set gets written to disk, this could take quite some time.
    • long filter: only a small set of companies match, only a few companies have to be sorted in the end, probably entirely in memory
  • score first - the date might have an impact, as it defines how many companies are affected
    • long filter in the end: the result of the previous aggregation steps have to be searched, in order to find matching elements. No index could be used for that. So the match operation might take longer as more criteria have to be evaluated - probably against data on disk.
    • short filter in the end: The result of the previous stages has to be searched only once.

So what you should check out:

  • disable the allowDiskUse, check if the query still fits into memory or check in the tmp files, whether data is actually written to disk
  • limit the scope of the search, reducing the amount of data to be processed
  • change you schema to better match your access pattern

Upvotes: 2

Related Questions