Reputation: 19270
Read this, and tried experimenting it.
Below is my schema.
SCHEMA:
{
"s" : "CB",
"c" : "REQ_RCV",
"e" : "sms_click",
"st" : "i",
"b" : "2",
"a" : "1",
"u" : "b1_h1_d1_m1_user_2",
"c#" : "b1_h1_d1_m1_cr-2",
"@" : ISODate("2016-10-01T06:03:00.000Z"), //Indexed
"@h" : "16100106", //hourly bucket
"@d" : "161001", //Indexed
"@m" : "1610"
}
And below is the explain plan:
> 2017-01-22T13:43:47.764+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @: { $gte: new
> Date(1483228800000), $lte: new Date(1483315199000) } } }, { $group: {
> _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0 } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s", TIME:
> "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET: 1.0,
> TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @: 1.0 }
> keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:925 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1860 } }, Database: { acquireCount: { r:
> 930 } }, Collection: { acquireCount: { r: 929 } } }
> protocol:op_command **3499ms**
> 2017-01-22T13:44:24.825+0530 I COMMAND [conn34] command test-analytics.template3 appName: "MongoDB Shell" command: aggregate {
> aggregate: "template3", pipeline: [ { $match: { @d: "170101" } }, {
> $group: { _id: { b: "$b", HOURLY: "$@h", s: "$s" }, count: { $sum: 1.0
> } } }, { $project: { _id: 0.0, BUCKET: "$_id.b", SERVICE: "$_id.s",
> TIME: "$_id.HOURLY", count: 1.0 } }, { $sort: { SERVICE: 1.0, BUCKET:
> 1.0, TIME: 1.0 } } ], cursor: {} } planSummary: IXSCAN { @d: 1.0 } keysExamined:106888 docsExamined:106888 hasSortStage:1
> cursorExhausted:1 numYields:865 nreturned:96 reslen:7095 locks:{
> Global: { acquireCount: { r: 1740 } }, Database: { acquireCount: { r:
> 870 } }, Collection: { acquireCount: { r: 869 } } }
> protocol:op_command **1294ms**
Questions:
$lte
, $gte
operators slow only on date ranges or even slow on numeric comparison as well?$and
for two $match
in a aggregate query to support range bucketing? Currently aggregate accepts multiple $match
but first output of $match
is given to second $match
, but what I want is adding/grouping of individual $match
results to the next pipeline.Possible answer for Q4:
db.template3.aggregate([
{
$match: {
$or: [
{"@d":"170301"},
{"@d":"170306"},
{"@d":"170202"},
{"@d":"170303"},
{"@d":"170304"},
{"@d":"170305"}
]
}
},
{ $project: { _id: 0, "b": 1, "s": 1, "@h": 1 } },
{
$group: {
_id: {"b": "$b", "HOURLY": "$@h", "s": "$s" },
count: { $sum: 1 }
}
},
{
$project: {
_id: 0,
"BUCKET": "$_id.b",
"SERVICE": "$_id.s",
"TIME": "$_id.HOURLY",
count: 1
}
},
{ $sort: { "SERVICE": 1, "BUCKET": 1, "TIME": 1 } }
]);
In this answer we can use mixes (daily and monthly) of buckets and still this will use its own index. Read https://docs.mongodb.com/manual/reference/operator/query/or/#or-clauses-and-indexes.
Sample query:
db.template3.aggregate([
{$match:{"@h":{$gte : 17020511, $lte : 17030511}, "st":"i"}},
{$project : {"_id":0, "@h":1,"c":1, "@m":1}},
{$group:{_id:{ "HOURLY":"$@h", "c":"$c"}, count:{$sum:1}}},
{$project : {_id:0, "COUNTER":"$_id.c","TIME":"$_id.HOURLY", count:1}},
{$sort:{"COUNTER":1,"TIME":1}}
]);
Output:
{ "count" : 2255, "COUNTER" : "REQ_RCVD", "TIME" : 17020511 }
{ "count" : 28888, "COUNTER" : "REQ_RCVD", "TIME" : 17020600 }
{ "count" : 37613, "COUNTER" : "REQ_RCVD", "TIME" : 17020601 }
{ "count" : 6723, "COUNTER" : "REQ_RCVD", "TIME" : 17020602 }
{ "count" : 14057, "COUNTER" : "REQ_RCVD", "TIME" : 17020603 }
{ "count" : 12405, "COUNTER" : "REQ_RCVD", "TIME" : 17020604 }
{ "count" : 2392, "COUNTER" : "REQ_RCVD", "TIME" : 17020611 }
{ "count" : 28784, "COUNTER" : "REQ_RCVD", "TIME" : 17020700 }
{ "count" : 37494, "COUNTER" : "REQ_RCVD", "TIME" : 17020701 }
{ "count" : 6697, "COUNTER" : "REQ_RCVD", "TIME" : 17020702 }
{ "count" : 13930, "COUNTER" : "REQ_RCVD", "TIME" : 17020703 }
{ "count" : 12493, "COUNTER" : "REQ_RCVD", "TIME" : 17020704 }
{ "count" : 2225, "COUNTER" : "REQ_RCVD", "TIME" : 17020711 }
{ "count" : 28821, "COUNTER" : "REQ_RCVD", "TIME" : 17020800 }
{ "count" : 37949, "COUNTER" : "REQ_RCVD", "TIME" : 17020801 }
{ "count" : 6676, "COUNTER" : "REQ_RCVD", "TIME" : 17020802 }
{ "count" : 14039, "COUNTER" : "REQ_RCVD", "TIME" : 17020803 }
{ "count" : 12349, "COUNTER" : "REQ_RCVD", "TIME" : 17020804 }
{ "count" : 2332, "COUNTER" : "REQ_RCVD", "TIME" : 17020811 }
{ "count" : 28379, "COUNTER" : "REQ_RCVD", "TIME" : 17020900 }
OPTIMIZATION
I felt more time is taken because of reading of non indexed fields.
hence docsExamined: 106888
There is somehow an improvement when I use $project
before $group
I have changed "@h"
data type from String
to Integer
(NumberInt
), I think it will improve some more.
Upvotes: 3
Views: 273
Reputation: 4413
Let's get to your question one by one:
Though both the queries examined the same number of documents Why there is time difference in the output?
Looking at performance metrics from just a single execution is really not how it works. You should take the average over several executions before concluding because there are several factors at play. That being said, MongoDB caches most frequently used documents in memory and keeps it there unless it has to yield memory for some other document. So if a query access documents already cached from a previous query, it should be faster.
Also in MongoDB aggregation uses indexes only at the beginning if any. For example $match
and $sort
phases can use indexes. In your case $match
is the first pipeline stage so that's a win.
Is $lte, $gte is slow only on date range or it is even slow on numeric comparison as well ..?
In MongoDB data is stored in BSON, so dates are basically numbers when they are compared. So there is no difference.
Since bucketing gives faster response, how to use bucketing for range queries? I can make multiple aggregate time bucket calls to support range queries, but that will make more round trip time, any suggestions?
Although I've not tested it I really doubt that the time_bucket approach will give faster response. Since created_at
will always increase, the index, in this case, will also be appended to the end without the time_bucket. Furthermore, the index size will be comparatively huge when created on an array than on a simple date field. Will that not cause the issue of fitting the index in RAM.
Using a time_bucket makes sense when you are using some function on the date field before matching. If you extract only year from date field before matching, it will make the existing index on the date useless.
It's always better to cast your parameters to match the datatype in database instead of the other way around.
Is that possible
$and
of two$match
in an aggregate query to support range bucketing. Currently Aggregate accepts multiple$match
but first output of $match is given to second $match, but what I want is adding/grouping of individual$match
results to the next pipeline.
Yes, it's possible. If it's $and
, you just specify all your filters separated by commas in the $match
phase. If it's $or
use the $or
operator.
If you have two $macth
phases one by one MongoDB combines it to one. So you don't need to worry about adding results of multiple match phases.
Now your Optimization Points
I felt more time is taken because of reading of non-indexed fields. hence docsExamined:106888
Yes, covered queries are much faster.
There is some betterment when I use $project before $group
If the size of documents is reduced in the $group
stage by use of $project
, then yes it's true.
I have changed
@h
dataType from string to int (NumberInt), I think it will improve some more.
It's not necessarily true but it's generally the case. You may check this answer.
Upvotes: 1