bresai
bresai

Reputation: 377

Match on key from two queries in a single query

I have time series data in mongodb as follows:

{ 
    "_id" : ObjectId("558912b845cea070a982d894"),
    "code" : "ZL0KOP",
    "time" : NumberLong("1420128024000"),
    "direction" : "10", 
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d895"), 
    "code" : "AQ0ZSQ", 
    "time" : NumberLong("1420128025000"), 
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d896"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128003000"),
    "direction" : "10", 
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d897"), 
    "code" : "ZL0KOP",
    "time" : NumberLong("1420041724000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d89e"),
    "code" : "YBUHCW",
    "time" : NumberLong("1420041732000"),
    "direction" : "10",
    "siteId" : "0002"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a1"),
    "code" : "U48AIW",
    "time" : NumberLong("1420041729000"),
    "direction" : "10",
    "siteId" : "0002"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a0"), 
    "code" : "OJ3A06",
    "time" : NumberLong("1420300927000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d89d"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420300885000"),
    "direction" : "10",
    "siteId" : "0003"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d8a2"),
    "code" : "ZLV05H",
    "time" : NumberLong("1420300922000"),
    "direction" : "10",
    "siteId" : "0001"
}
{
    "_id" : ObjectId("558912b845cea070a982d8a3"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420300928000"),
    "direction" : "10", 
    "siteId" : "0000"
}

The codes that match two or more conditions need to be filtered out. For example:

condition1: 1420128000000 < time < 1420128030000,siteId == 0000
condition2: 1420300880000 < time < 1420300890000,siteId == 0003

results for the first condition:

{ 
    "_id" : ObjectId("558912b845cea070a982d894"),
    "code" : "ZL0KOP",
    "time" : NumberLong("1420128024000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d895"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128025000"),
    "direction" : "10",
    "siteId" : "0000"
}
{ 
    "_id" : ObjectId("558912b845cea070a982d896"),
    "code" : "AQ0ZSQ",
    "time" : NumberLong("1420128003000"),
    "direction" : "10",
    "siteId" : "0000"
}

results for the second condition:

{ 
    "_id" : ObjectId("558912b845cea070a982d89d"),
    "code" : "AQ0ZSQ", "time" : NumberLong("1420300885000"),
    "direction" : "10",
    "siteId" : "0003"
}

The only code that matchs all the conditions above should be:

{"code" : "AQ0ZSQ", "count":2}

"count" means, the code "AQ0ZSQ" appeared in both conditions

The only solution I can think of is using two querys. For example, using python

result1 = list(db.codes.objects({'time': {'$gt': 1420128000000,'$lt': 1420128030000}, 'siteId': "0000"}).only("code"))
result2 = list(db.codes.objects({'time': {'$gt': 1420300880000,'$lt': 1420300890000}},{'siteId':'0003'}).only("code"))

and then found the shared code in both results.

The Problem is that there are millions of documents in the collection, and both query can easily exceed the 16mb limitation.

So is it possible to do that in one query? or should I change the document structure?

Upvotes: 1

Views: 1471

Answers (1)

Blakes Seven
Blakes Seven

Reputation: 50406

What you are asking for here requires the usage of the aggregation framework in order to calculate that there was an intersection between results on the server.

The first part of the logic is you need an $or query for the two conditions, then there will be some additional projection and filtering on those results:

db.collection.aggregate([
    // Fetch all possible documents for consideration
    { "$match": {
        "$or": [
            { 
                "time": { "$gt": 1420128000000, "$lt": 1420128030000 },
                "siteId": "0000"
            },
            {
                "time": { "$gt": 1420300880000, "$lt": 1420300890000 },
                "siteId": "0003"
            }
        ]
    }},

    // Locigically compare the conditions agaist results and add a score
    { "$project": {
        "code": "$code",
        "score": { "$add": [
            { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420128000000 ] },
                    { "$lt": [ "$time", 1420128030000 ] },
                    { "$eq": [ "$siteId", "0000" ] }
                ]},
                1,
                0
            ]},
            { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420300880000 ] },
                    { "$lt": [ "$time", 1420300890000 ] },
                    { "$eq": [ "$siteId", "0003" ] }
                ]},
                1,
                0
            ]}
        ]}
    }},

    // Now Group the results by "code"
    { "$group": {
        "_id": "$code",
        "score": { "$sum": "$score" }
    }},

    // Now filter to keep only results with score 2
    { "$match": { "score": 2 } }
])

So break that down and see how it works.

First you want a query with $match to get all the possible documents for "all" of your conditions of "intersection". That is what the $or expression allows here by considering that matched documents must meet either set. You need all of them to work out the "intersection" here.

In the second $project pipeline stage a boolean test of your conditions is performed with each set. Notice the usage of $and here as well as other boolean operators of the aggregation framework is slightly different to that of the query usage form.

In the aggregation framework form ( outside of $match which uses normal query operators ) these operators take an array of arguments, to typically represent "two" values for comparison rather than the operation being assigned to the "right" of the field name.

Since these conditions are logical or "boolean" we want to return the result as "numeric" rather than a true/false value. This is what $cond does here. So where the condition is true for the document inspected a score of 1 is emitted otherwise it is 0 when false.

Finally in this $project expression both of your conditions are wrapped with $add to form the "score" result. So if none of the conditions ( not possible after the $match ) were not true the score would be 0, if "one" is true then 1, or where "both" are true then 2.

Noting here that the specific conditions asked for here will never score above 1 for a single document since no document can have the overlapping range or "two" "siteId" values as is present here.

Now the important part is to $group by the "code" value and $sum the score value to get a total per "code".

This leaves the final $match filter stage of the pipeline to only keep those documents with a "score" value that is equal to the number of conditions you asked for. In this case 2.


There is a failing there however in that where there is more than one value of "code" in the matches for either condition ( as there is ) then the "score" here would be incorrect.

So after the introduction to the principles of using logical operators in aggregation, you can fix that fault by essentially "tagging" each result logically as to which condition "set" it applies to. Then you can basically consider which "code" appeared in "both" sets in this case:

db.collection.aggregate([
    { "$match": {
        "$or": [
            { 
                "time": { "$gt": 1420128000000, "$lt": 1420128030000 },
                "siteId": "0000"
            },
            {
                "time": { "$gt": 1420300880000, "$lt": 1420300890000 },
                "siteId": "0003"
            }
        ]
    }},

    // If it's the first logical condition it's "A" otherwise it can
    // only be the other, therefore "B". Extend for more sets as needed.
    { "$group": {
        "_id": {
            "code": "$code",
            "type": { "$cond": [ 
                { "$and":[
                    { "$gt": [ "$time", 1420128000000 ] },
                    { "$lt": [ "$time", 1420128030000 ] },
                    { "$eq": [ "$siteId", "0000" ] }
                ]},
                "A",
                "B"
            ]}
        }
    }},

    // Simply add up the results for each "type"
    { "$group": {
        "_id": "$_id.code",
        "score": { "$sum": 1 }
    }}

    // Now filter to keep only results with score 2
    { "$match": { "score": 2 } }
])

It might be a bit to take in if this is your first time using the aggregation framework. Please take the time to look at the operators used as defined with the links here and also look at Aggregation Pipeline Operators in general.

Beyond simple data selection, this is the tool you should be reaching to most often when using MongoDB, so you would do well to understand all the operations that are possible.

Upvotes: 1

Related Questions