MongoDB index determination to maximize performances and minimize indexes

Question

Indexes in MongoDB should be kept at a minimum. Or so I try after reading this. ..and I also fiddled with this nice preso with some great tips.

I have a question and let me devise an example:

I need to store data about goods stored in locations which are indexed by catalog.

I have several locations. In each location I can have many goods. Each good can either be in stock or not. Every triplet is found in a given catalogs.

CATALOG_X: { { location: "...", good: "...", stock_or_not: 0|1 },
             { location: "...", good: "...", stock_or_not: 0|1 },
             ...
           }

e.g.:

CATALOG1: NEWYORK - BREAD - 0 // 0 is out of stock
CATALOG2: LOSANGELES - WATER - 1 // 1 is in stock

One complication: I obtain the status of these by a query which is only by catalog. Like:

getMeStatusOfCatalog("CATALOG1") -> { "NEWYORK - BREAD" : 0 } // location - good : instock

However: important point, getMeStatusOfCatalog may just stop giving me an answer of a particular good, so If I don't see BREAD anymore, I still need to set to 0 the BREAD for that particular catalog. This means that in my "write" operation I need to get all the past "active" goods based on a catalog (below in_catalog).

My main business objective in my application is that I need to display the breakdown of each location, like:

Location 1 (New York) has: bread which is out of stock, etc
Location 2 (Los Angeles) has: water which is fine, etc

if I set up my _id as "LOCATION:GOOD" then I can query as /^LOCATION/ and get all the things I need*. Overall:

{ 
  _id: "LOCATION:GOOD"
  catalog: "..." 
  in_stock: 0|1
  in_catalog: 0|1
}

However, I am troubled by data insertion/update based on the getMeStatusOfCatalog. How will I fetch all the documents which need to be updated?

I can't add it to the _id discussed before since I won't be able to use left-anchored-regex anymore. Do I need to make catalog an index (compound, unique)? Like:

{ 
  _id: "LOCATION:GOOD:CATALOG" // unique by itself, repeating catalog to ensure unique, used for fast reads
  _catalog: "..." // unique by itself, used for fast writes
  in_stock: 0|1
  in_catalog: 0|1
}

createIndex( { "_id": 1, "_catalog": 1 }, { unique: true } )

In this way I sacrifice space (double index) to have faster queries.

How would you design this as to have both performant reads and writes?

Joe Enzminger · Accepted Answer

There is always a trade-off between insertion and query performance. As you add indexes to optimize your queries, you will impact insertion rates.

However, unless your use case is a web scale transaction or event processing system (thousands of transactions per second), you are unlikely to notice insertion performance issues due to one or two compound indexes. You may be making your solution more complex in order to solve a problem you don't have.

In fact, your proposed solution is inadvertently making an insertion problem more likely!

It is true that if possible, you should use the _id field. However, as implemented, the default _id field has two important properties:

Uniqueness
It increases with time

When MongoDB performs an insert or an update, it places the document on disk based on the value of _id (this behavior varies based on the storage engine - MMapv1 being especially susceptible). In your implementation, you are losing this second property, so it is entirely possible that, at scale, a single insert or update may force MongoDB to reorganize your data on disk, which will be extremely slow.

If you carefully read the presentation you reference, you'll see that their choice of a key for _id maintains both these properties!

For your application, I would leave _id alone and instead include catalog, location, and good as document properties in a compound index. Yes, it will marginally increase storage requirements and imperceptibly impact insertion performance, but your data set will be very large before you notice a performance issue.

I also think you are artificially impacting your query performance by using a regex search. While the feature is there, MongoDB isn't built to be a full-text-search engine. If you need something like that (Google-like queries), then consider using ElasticSearch or related technologies.

Finally, should you indeed need web scale insertion performance while maintaining query performance, you might find yourself making trade-offs elsewhere. For instance, you could insert or update the documents with no indexes, and then use the oplog to keep a "read-only" indexed version in sync. There are some really innovative architectures that marry MongoDB and ElasticSearch in this way that provide excellent performance (For instance: http://blog.mpayetta.com/elasticsearch/mongodb/2016/08/04/full-text-indexing-with-elastic-search-and-mongodb/

MongoDB index determination to maximize performances and minimize indexes

Answers (1)

Related Questions