mongodb - need to process massive data, with only one server instance

Question

I'm trying to process about a hundred million records in mongodb. Basically, each key (prescription number) responds to about 1300 records (not unique). These keys have been indexed.

Right now, I am querying specific key with pymongo to return those sets of results so they can be processed with python.

Querying mongo is the biggest bottle neck. It is taking about 20 seconds per query. At the current rate, it will take 400 hrs to query every record.

This is what I looks like when I 'explain' my query:

db.prescriptions.find({'key':68565299}).explain()

                            {
    "cursor" : "BasicCursor",
    "nscanned" : 103578563,
    "nscannedObjects" : 103578563,
    "n" : 1603,
    "millis" : 287665,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {

    }
}

And this shows that I have the indexes in place

> db.prescriptions.getIndexes()
[
    {
        "v" : 1,
        "key" : {
            "_id" : 1
        },
        "ns" : "processed_data.prescriptions",
        "name" : "_id_"
    }
]

Am I off my rocker for trying to run this data processing on one server instance? (Interestingly, my CPU and RAM do not appear to be maxed out when I run top.)

I would be grateful for any advice.

Thanks!!

AD7six · Accepted Answer

Add an index

From the explain result in the query, there is no index on "key", you need to add one.

> db.prescriptions.addIndex({'key': 1});

If mongo reports any kind of warning, you'll need to action it

mongodb - need to process massive data, with only one server instance

Answers (1)

Add an index

Related Questions