Reputation: 21

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.

Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.

    User Collection
    {
      "userid" : "userid1",
      "password" : "xyz",
      ,
      "City" : "New York",
    },
    {
      "userid" : "userid2",
      "password" : "abc",
      ,
      "City" : "New York",
    }


    responses Collection
    {
      "userid": "userid1",
      "responseID": "responseID1",
      "response" : "xyz"
    },
    {
      "userid": "userid1",
      "responseID": "responseID2",
      "response" : "abc"
    },
    {
      "userid": "userid2",
      "responseID": "responseID3",
      "response"  : "mno"
    }

Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).

    {
      "userid" : "userid1",
      "responseID1" : "xyz",
      "responseID2" : "abc",
      ,
      "responseN"; "mno",
      "city" : "New York"
    }

Upvotes: 2

Answers (2)

kmfk

Reputation: 3961

Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.

Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.

Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.

So, I would recommend using a separate collection for the responses.

Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.

Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.

Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

Upvotes: 1

Patrick Lorio

Reputation: 5668

If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.

{
   "userId": n,
   "city": "foo"
   "responses": {
       "responseId1": "response message 1",
       "responseId2": "response message 2"
   }
}

As for which would render a better performance, run a few benchmark tests.

Upvotes: 1

MongoDB design for scalability

Answers (2)

Related Questions