Reputation: 3483

running mongo queries against data in memory

I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they're running nearly continuously.

The query associated with each count operation is rather involved, and I don't think there's a way to get them all to run with indices (i.e. as COUNT_SCAN operations).

The only feasible solution I've come up with is to:

Run a full collection scan every hour, pulling every document out of the db
Once each document is in memory, run all of the count operations against it myself

Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.

So my question is whether there's any support from mongo drivers (pymongo in my case, but I'm curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.

Initially this felt like an odd request, but there's actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.

Upvotes: 3

Answers (4)

Juan Bermudez

Reputation: 430

Maybe you could try another approach? I mean, MongoDB performs really bad in counting, overall with big collections.

I had a pretty similar problem in my last company and what we did is to create some "counters" object, and update them in every update you perform over your data. In this way, you avoid counting at all.

The document would be something like:

{
query1count: 12,
query2count: 512312,
query3count: 6
}

If the query1count is related to the query: "all documents where userId = 13", then in your python layer you can check before creating/updating a document if the userId = 13, and if so then increase the desired counter.

It will do add a lot of extra complexity to your code, but the reads of the counters will be performed in O(1).

Of course, not all the queries may be that easy but you can reduce a lot the execution time with this approach.

Upvotes: 0

AJS

Reputation: 2023

As you are using python, have you considered Pandas?, you could basically try and transform your JSON data to pandas data frame and query it as you like, you could achieve whole bunch of operations like count, group by, aggregate etc. Please take a look at the doc. Adding a small example below to help you relate. Hope this helps.

For example:

import pandas as pd
from pandas.io.json import json_normalize
data = {
"data_points":[
    {"a":1,"b":3,"c":2},
    {"a":3,"b":2,"c":1},
    {"a":5,"b":4,"d":3}
   ]
}
# convert json to data frame
df = json_normalize(data["data_points"])

Pandas data frame view above.

now you could just try and perform operation on them like sum, count etc.

Example:

# sum of column `a`
df['a'].sum()

output: 9

# sum of column `c` that has null values.
df['c'].sum()

output: 3.0

# count of column `c` that has null values.
df['c'].count()

output: 2

Upvotes: 1

nonagon

Reputation: 3483

Here's the code I have currently to solve this problem. I have enough tests running against it to qualify it for my use case, but it's probably not 100% correct. I certainly don't handle all possible query documents.

def check_doc_against_mongo_query(doc, query):
    """Return whether the given doc would be returned by the given query.

    Initially this might seem like work the db should be doing, but consider a use case where we
    need to run many complex queries regularly to count matches. If each query results in a full-
    collection scan, it is often faster to run a single scan fetching the entire collection into
    memory, then run all of the matches locally.

    We don't support mongo's full query syntax here, so we'll need to add support as the need
    arises."""

    # Run our check recursively
    return _match_query(doc, query)


def _match_query(doc, query):
    """Return whether the given doc matches the given query."""

    # We don't expect a null query
    assert query is not None

    # Check each top-level field for a match, we AND them together, so return on mismatch
    for k, v in query.items():
        # Check for AND/OR operators
        if k == Mongo.AND:
            if not all(_match_query(doc, x) for x in v):
                return False
        elif k == Mongo.OR:
            if not any(_match_query(doc, x) for x in v):
                return False
        elif k == Mongo.COMMENT:
            # Ignore comments
            pass
        else:
            # Now grab the doc's value and match it against the given query value
            doc_v = nested_dict_get(doc, k)
            if not _match_doc_and_query_value(doc_v, v):
                return False

    # All top-level fields matched so return match
    return True


def _match_doc_and_query_value(doc_v, query_v):
    """Return whether the given doc and query values match."""

    cmps = []  # we AND these together below, trailing bool for negation

    # Check for operators
    if isinstance(query_v, Mapping):
        # To handle 'in' we use a tuple, otherwise we use an operator and a value
        for k, v in query_v.items():
            if k == Mongo.IN:
                cmps.append((operator.eq, tuple(v), False))
            elif k == Mongo.NIN:
                cmps.append((operator.eq, tuple(v), True))
            else:
                op = {Mongo.EQ: operator.eq, Mongo.GT: operator.gt, Mongo.GTE: operator.ge,
                      Mongo.LT: operator.lt, Mongo.LTE: operator.le, Mongo.NE: operator.ne}[
                          k]
                cmps.append((op, v, False))
    else:
        # We expect a simple value here, perform an equality check
        cmps.append((operator.eq, query_v, False))

    # Now perform each comparison
    return all(_invert(_match_cmp(op, doc_v, v), invert) for op, v, invert in cmps)


def _invert(result, invert):
    """Invert the given result if necessary."""

    return not result if invert else result


def _match_cmp(op, doc_v, v):
    """Return whether the given values match with the given comparison operator.

    If v is a tuple then we require op to match with any element.

    We take care to handle comparisons with null the same way mongo does, i.e. only null ==/<=/>=
    null returns true, all other comps with null return false. See:
    https://stackoverflow.com/questions/29835829/mongodb-comparison-operators-with-null
    for details.

    As an important special case of null comparisons, ne null matches any non-null value.
    """

    if doc_v is None and v is None:
        return op in (operator.eq, operator.ge, operator.le)
    elif op is operator.ne and v is None:
        return doc_v is not None
    elif v is None:
        return False
    elif isinstance(v, tuple):
        return any(op(doc_v, x) for x in v)
    else:
        return op(doc_v, v)

Upvotes: 0

wowkin2

Reputation: 6355

MongoDB In-Memory storage engine

If you want to process data using complex queries only in RAM using MongoDB syntax, you can configure MongoDB to use In-Memory only storage engine that avoids disk I/O at all.
For me, it is the best option to have the ability to have complex queries and best performance.

Python in-memory databases:

You can use one of the following:

PyDbLite - a fast, pure-Python, untyped, in-memory database engine, using Python syntax to manage data, instead of SQL
TinyDB - if you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you. But not a fast solution and have few other disadvantages.

They should allow working with data directly in RAM, but I'm not sure if this is better than the previous option.

Own custom solution (e.g. written in Python)

Some services handle data in RAM only on application level only. If your solution is not complicated and queries are simple - this is ok. But since some time queries become more complicated and code require some abstraction level (for advanced CRUD), like previous databases.

The last solution can have the best performance, but it takes more time to develop and support it.