Improve query performance in large collection with large documents using indexes or in any other possible way

Question

I am using PyMongo with Flask and I would like to know how to optimize a query, as I am filtering within a large collection (8793 documents) with large documents.

This is one of the document structures of the collections:

As you can see, it has 4 properties (simulationID, simulationPartID, timePass and status, which stores many arrays). This collection has a size of 824.4MB. The average size of the documents is 96.0KB.

Basically, I’m trying to find the documents that have simulationPartID 7 (1256 documents) and filter on them the array index equal to the nodeID value (which I receive as a parameter) within the status property, and take the fourth or fifth element of this array (depending on the case parameter), in addition to append the timePass.

def node_history(nodeID, case):
    coordinates = []
    node_data = db['node_data']
    db.node_data.create_index([('simulationPartID', 1), ('simulationID', 1)])
    if case == 'Temperature':
        for document in node_data.find({"simulationPartID": 7}):
            coordinates.append([document['timePass'], document['status'][int(nodeID)-1][3]])
    elif case == 'Stress':
        for document in node_data.find({"simulationPartID": 7}):
            coordinates.append([document['timePass'], document['status'][int(nodeID)-1][4]])
    else:
        pass
    coordinates.sort()
    return json.dumps(coordinates, default=json_util.default)

As I mentioned, the collection is very large, and the query takes about 30 - 60 seconds to be performed, depending on the machine, but I want it to run as quickly as possible because I want my application to be as interactive as possible. As you can seeI already tried to create a index in both simulationID and simulationPartID properties.

I never worked with large collections before, so I'm not into indexing. I don't even know if I did it properly in my code. So, I would like to know if there is a way to optimize my query using a different approach of indexes, or in any other possible way, and make it faster.

Data samples:

{
  "_id": {
    "$oid": "5f83f54d45104462898aba67"
  },
  "simulationID": "001",
  "simulationPartID": 7,
  "timePass": 0,
  "status": [
    [
      1,
      1.34022987724954e-40,
      0.00220799725502729,
      20,
      114.911392211914
    ],
    [
      2,
      0.00217749993316829,
      0.00220799725502729,
      20,
      -2.0458550453186
    ],
    [
      3,
      0.0020274999551475,
      0.00235799723304808,
      20,
      -1.33439755439758
    ],
    [
      4,
      3.36311631437956e-44,
      0.00235799723304808,
      20,
      148.233413696289
    ],
    [
      5,
      1.02169119449431e-38,
      0.000149997213156894,
      20,
      -25633.59765625
    ],
  ]
},

{  
  "_id": {
    "$oid": "5f83f54d45104462898aba68"
  },
  "simulationID": "001",
  "simulationPartID": 7,
  "timePass": 1,
  "status": [
    [
      1,
      1.34022987724954e-40,
      0.00220799725502729,
      20,
      114.911392211914
    ],
    [
      2,
      0.00217749993316829,
      0.00220799725502729,
      20,
      -2.0458550453186
    ],
    [
      3,
      0.0020274999551475,
      0.00235799723304808,
      20,
      -1.33439755439758
    ],
    [
      4,
      3.36311631437956e-44,
      0.00235799723304808,
      20,
      148.233413696289
    ],
    [
      5,
      1.02169119449431e-38,
      0.000149997213156894,
      20,
      -25633.59765625
    ],
  ]
},
{
"_id": {
    "$oid": "5f83f54d45104462898aba69"
  },
  "simulationID": "001",
  "simulationPartID": 7,
  "timePass": 2,
  "status": [
    [
      1,
      1.34022987724954e-40,
      0.00220799725502729,
      20,
      114.911392211914
    ],
    [
      2,
      0.00217749993316829,
      0.00220799725502729,
      20,
      -2.0458550453186
    ],
    [
      3,
      0.0020274999551475,
      0.00235799723304808,
      20,
      -1.33439755439758
    ],
    [
      4,
      3.36311631437956e-44,
      0.00235799723304808,
      20,
      148.233413696289
    ],
    [
      5,
      1.02169119449431e-38,
      0.000149997213156894,
      20,
      -25633.59765625
    ],
  ]
}

Thank you!

Wernfried Domscheit · Accepted Answer

Do you create the index for each query? An index is created only once when you deploy the application.

Your find returns the full document, which is not needed. You can limit the result with $slice

db.node_data.find({"simulationPartID": 7}, {"timePass": 1, "status": { '$slice': [ 3, 1 ] } } )

This should return the data much faster because it returns only the values you like to get.

If you like to select a sub-elements from array, then you can use this one:

db.collection.aggregate([
  { $match: { timePass: 2 } },
  { $set: { status: { $arrayElemAt: [ "$status", 4 ] } } },
  { $set: { status: { $arrayElemAt: [ "$status", 3 ] } } },
])

Mongo playground

Improve query performance in large collection with large documents using indexes or in any other possible way

Answers (1)

Related Questions