ltfishie
ltfishie

Reputation: 2987

Performance of querying relational data with document-based nosql (mongodb, couchdb, and riak, etc)

To follow up on my question on modeling relational data with nosql, I have read several articles on the subject:

Nosql doesn't mean non-relational

Nosql Ecommerce Example

They seem to suggest that nosql can handle normalized, relational data.

So let's continue with the example I had before, a CMS system that have two types of data: article and authors, where article has an reference (by ID) to author.

Below are the operations the system needs to support:

  1. Fetch a article by id along with the author
  2. Fetch all articles by particular author
  3. Find the first 10 article(s) with the author(s) sorted by creation date

I would like to understand the performance of these operation when compare to the same operation if the same data were stored on RDBMS. In particular, please specify if the operation uses MapReduce, require multple trips to the nosql store (Links), or pre-join

I would like to limit to discussion to document-based nosql solution like mongodb, couchdb, and riak.

Edit 1:

Spring-data project is avalible on Riak and Mongodb

Upvotes: 5

Views: 3410

Answers (3)

BigBlueHat
BigBlueHat

Reputation: 2365

Just wanted to toss in a CouchDB answer for anyone who might be curious. :)

As mentioned in the first answer above, embedding the author document into the article document is unwise, so the examples below assume two document types: articles and authors.

CouchDB uses MapReduce queries typically written in JavaScript (but Python, Ruby, Erlang and others are availble). The results of a MapReduce query are stored in an index upon their first request and that stored index is used to for all future look-ups. Changes to the database are added to the index upon further requests.

CouchDB's API is completely HTTP-based, so all requests to the database are HTTP verbs (GET, POST, PUT, DELETE) at various URLs. I'll be listing both the MapReduce queries (written in JavaScript) along with the URL used to request related results from the index.

1. Fetch a article by id along with the author

The simplest method for doing this is two direct document lookups:

GET /db/{article_id}
GET /db/{author_id}

...where {author_id} is the value obtained from the article's author_id field.

2. Fetch all articles by particular author

MapReduce

function (doc) {
  if (doc.type === 'article') {
    emit(doc.author_id, doc);
  }
}
GET /db/_design/cms/_view/articles_by_author?key="{author_id}"

...where {author_id} is the actual ID of the author.

3. Find the first 10 article(s) with the author(s) sorted by creation date

MapReduce

function (doc) {
  function arrayDateFromTimeStamp(ts) {
    var d = new Date(ts);
    return [d.getFullYear(), d.getMonth(), d.getDate(), d.getHours(), d.getMinutes(), d.getSeconds()];
  }

  var newdoc = doc;
  newdoc._id = doc.author_id;
  newdoc.created_at = arrayDateFromTimeStamp(doc.created_at);

  if (doc.type === 'article') {
    emit(newdoc.created_at, newdoc); 
  }
}

It's possible to do include style "joins" in CouchDB using ?include_docs=true in a view request. If you include a "_id" key in the value side of the emit (the second argument), then adding include_docs=true to your query parameters will include the doc referenced by the specified "_id" In the case above, we're replacing the document's own "_id" (which we don't need anymore) with the referenced author's "_id" (the value of "author_id" in the article document). Requesting the top 10 articles with their related author info looks like this:

GET /db/_design/cms/_view/articles_by_date?descending=true&limit=10&include_docs=true

Requesting that URL will return a list of the most recent 10 articles in a format similar to:

{"rows":[
  { "id":"article_id",
    "key":[2011, 9, 3, 12, 5, 41],
    "value":{"_id":"author_id", "title":"..."},
    "doc":{"_id":"author_id", "name":"Author Name"}
  }
]}

Using this same index you can get a list of all the documents any any year, month, day, hour, etc granularity with or without author data.

There are also methods for using view collation to aggregate several documents together out of a single document (like a page in a CMS referencing disparate content). There's some info on how to do that in these slides I did for CouchConf in July: http://www.slideshare.net/Couchbase/couchconfsfdesigningcouchbasedocuments

If you have any other questions, please let me know.

Upvotes: 5

Gates VP
Gates VP

Reputation: 45287

Fetch a article by id along with the author

SQL:

  • 1 query
  • 2 index lookups
  • 2 data lookups
  • data returned = article + author

MongoDB:

  • 2 queries
  • 2 index lookups
  • 2 data lookups
  • data returned = article + author

Fetch all articles by particular author

SQL:

  • 1 query
  • 1 index lookup
  • N data lookups
  • data returned = N articles

MongoDB:

  • 1 query
  • 1 index lookup
  • N data lookups
  • data returned = N articles

Find the first 10 article(s) with the author(s) sorted by creation date

SQL:

  • 1 query
  • 2 index lookups
  • 11 to 20 data lookups (articles then unique authors)
  • data returned = 10 articles + 10 authors

MongoDB:

  • 2 queries (articles.find().sort().limit(10), authors.find({$in:[article_authors]})
  • 2 index lookups
  • 11 to 20 data lookups (articles then unique authors)
  • data returned = 10 articles + 1 to 10 authors

Summary

In two cases MongoDB requires an extra query, but does most of the same total work underneath. In some cases MongoDB returns less data over the network (no repeated entries). The join queries tend to be limited by the requirement that all the data to join live on the same box. If Authors and Articles live in different places, then you end up doing two queries anyways.

MongoDB tends to get better "raw" performance because it doesn't flush to disk with every write (so it's actually a "durability" tradeoff). It also has a much smaller query parser, so there's less activity per query.

From a basic performance standpoint these things are very similar. They just make different assumptions about your data and the trade-offs you want to make.

Upvotes: 4

Travis
Travis

Reputation: 10547

For MongoDB, you wouldn't use embedded documents for the author record. So the pre-join is out, it's multiple trips to the DB. However, you can cache the author and only need to make that second trip once for each record. The queries you indicated are pretty trivial in MongoDB.

var article = db.articles.find({id: article_id}).limit(1);
var author = db.authors.find({id: article.author_id});

If you are using an ORM/ODM to manage your entities within your application, this would transparent. It would be two trips to the db though. They should be fast responses though, two hits shouldn't be noticeable at all.

Finding articles by a given author is just reverse...

var author = db.authors.find({id: author_name}).limit(1);
var articles = db.articles.find({author_id: author.id});

So again, two queries but the single author fetch should be fast and can easily be cached.

var articles = db.articles.find({}).sort({created_at: 1}).limit(10);
var author_ids = articles.map(function(a) { return a.author_id });
var authors = db.authors.find({id: { '$in': authors_ids }});

Lastly, again, two queries but just a tiny bit more complex. You can run these in a mongo shell to see what the results might be like.

I'm not sure this is worth writing a map reduce to complete. A couple quick round trips might have a little more latency but the mongo protocol is pretty fast. I wouldn't be overly worried about it.

Lastly, real performance implications of doing it this way... Since ideally you'd only be querying on indexed fields in the document, it should be pretty quick. The only additional step is a second round trip to get the other documents, depending how your application and db is structures, this is likely not a big deal at all. You can tell mongo to only profile queries that take over a given threshold (100 or 200ms by default when turned on), so you can keep an eye on what's taking time for your program as data grows.

The one befit you have here that an RDMS does not offer is much easier breaking apart of data. What happens when you expand your application beyond CMS to support other things but uses the same authentication store? It just happens to be a completely separate DB now, that's shared across many applications. It's much simpler to perform these queries across dbs - with RDMS stores it's a complex process.

I hope this helps you in your NoSQL discovery!

Upvotes: 2

Related Questions