Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

Question

DocumentDB has it's strengths. I think most would agree that creating associations between documents is not one of those strengths.

From what I read, the common strategy is to keep your data as denormalized as possible and custom logic around updating denormalized data when the original changes.

But what if you need your data to be normalized in some places? Lets say I have People and IceCreamFlavors. a Person has a FavorityIceCreamFlavor that is an id to an IceCreamFlavor.

From what I understand, if I need to get the IceCreamFlavor for this person, I'll need to run a second query to fetch the IceCreamFlavor.

(single collection documentdb)

SELECT * FROM c c.id = "person-1"

{
  "firstName": "John",
  "lastname": "Doe",
  "favorityIceCreamFlavor": "icecream-4"
}

Fetch IceCreamFlavor-

select * From c WHERE c.id = "icecream-4"

{
  "name": "Chocolate"
}

Combine Objects-

{
  "firstName": "John",
  "lastname": "Doe",
  "favorityIceCreamFlavor": {
     "name": "Chocolate"
   }
}

Obviously not ideal, but if i'm looking a persons profile this isn't the worst. Additionally, with this flavor of document storage (documentdb), I can create stored procedures so I can do this sub-query server side.

But what if I'm an administrator and I want to see all of my users and their favorite ice creams?

This is starting to look like a problem. It looks like I have to do 11 sub-queries to fetch each users ice cream flavor.

This may simply be a problem that document storage cannot handle efficiently. But i'm making that assumption- I don't know the how documentdb works under the hood.

Should I be concerned about documentdb doing a query for each record here in a stored procedure?

Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

Larry Maccherone · Accepted Answer

A database needs to do two queries to accomplish a join. Now both of them can be in cache either for just the indexes or in some cases the entire operation. Also, that work is done in the same memory space and very close to the data in such a way that throughput constraints don't come into play.

DocumentDB/CosmosDB has something very close to that if you can do both queries in a stored procedure. You can only do that if both sets are in the same partition and they can be accomplished before the sproc times out (happens between 5K and 20K docs retrieved on large databases), but if you can use a stored procedure, then you are in the same memory space and very close to the data. The difference in latency between an SQL join and two round trips in DocumentDB/CosmosDB sproc will be minimal, single digit milliseconds on a database of 100k documents where your query only pulls back 100's of documents in my estimation.

A couple of other downsides to using sprocs for queries I should mention: 1) It can consume more RUs depending upon the complexity of your join logic, and 2) Sprocs execute in isolation which can limit concurrency and reduce the overall throughput of the system. On the other hand, you get guaranteed ACID consistency even when one of the other less strong consistency models is in effect for non-sproc queries.

If you can't use a sproc because of the reasons discussed above, then you'll need to pull the data back across the wire for the first query before composing the second one. In this case, you will run into throughput constraints and additional latency. How much depends upon a lot parameters. Using an app server in the same data center as the DocumentDB/CosmosDB partition holding the data will keep this to a minimum, but even that will still come with a penalty. It sill may be milliseconds difference depending but it will have an effect. If you have to leave the data center with the first round before composing the second, the effect will be even greater.

Every application is different but for typical OLTP traffic, I've always been able to get the performance I've needed out DocumentDB. And even heavy analytical loads can work especially if you are careful with partition key choice to get sufficient parallelism. I suggest you give it a try with a simple experiment close to your intended final product and see how it does.

Hope this helps.

Do DocumentDB Subqueries Perform Well Enough to do RDMS Style Joins?

Answers (1)

Related Questions