Lloyd
Lloyd

Reputation: 1403

MongoDB data structure with large number internal documents

I am relatively new to MongoDB, and so far am really impressed. I am struggling with the best way to setup my document stores though. I am trying to do some summary analytics using twitter data and I am not sure whether to put the tweets into the user document, or to keep those as a separate collection. It seems like putting the tweets inside the user model would quickly hit the limit with regards to size. If that is the case then what is a good way to be able to run MapReduce across a group of user's tweets?

I hope I am not being too vague but I don't want to get too specific and too far down the wrong path as far as setting up my domain model.

As I am sure you are all bored of hearing, I am used to RDB land where I would lay out my schema like

| USER |
--------
|ID
|Name
|Etc.

|TWEET__|
---------
|ID
|UserID
|Etc

It seems like the logical schema in Mongo would be

User
|-Tweet (0..3000)
  |-Entities
    |-Hashtags (0..10+)
    |-urls (0..5)
    |-user_mentions (0..12)
  |-GeoData (0..20)
|-somegroupID

but wouldn't that quickly bloat the User document beyond capacity. But I would like to run analysis on tweets belonging to users with similar somegroupID. It conceptually makes sense to to the model layout as above, but at what point is that too unweildy? And what are viable alternatives?

Upvotes: 0

Views: 850

Answers (2)

Lloyd
Lloyd

Reputation: 1403

All credit to the fine folks at MongoHQ.com. My question was answered over on https://groups.google.com/d/msg/mongodb-user/OtEOD5Kt4sI/qQg68aJH4VIJ

Chris Winslett @ MongoHQ


You will find this video interesting:

http://www.10gen.com/presentations/mongosv-2011/schema-design-at-scale

Essentially, in one document, store one days of tweets for one person. The reasoning:

  • Querying typically consists of days and users

Therefore, you can have the following index:

{user_id: 1, date: 1} # Date needs to be last because you will range and sort on the date

Have fun!

Chris MongoHQ


I think it makes the most sense to implement the following:

user

{ user_id: 123123,
  screen_name: 'cledwyn',
  misc_bits: {...},
  groups: [123123_group_tall_people, 123123_group_techies, ],
  groups_in: [123123_group_tall_people]
}

tweet

{ tweet_id: 98798798798987987987987,
  user_id: 123123,
  tweet_date: 20120220,
  text: 'MongoDB is pretty sweet',
  misc_bits: {...},
  groups_in: [123123_group_tall_people]
}

Upvotes: 1

Derick
Derick

Reputation: 36774

You're right that you'll probably run into the 16MB MongoDB document limit here. You are not saying what sort of analysis you'd like to run, so it is difficult to recommend a schema. MongoDB schemas are designed with the data-query (and insertion) patterns in mind.

Instead of putting your tweets in a user, you can of course quite easily do the opposite, add a user-id and group-id into the tweet documents itself. Then, if you need additional fields from the user, you can always pull that in a second query upon display.

I mean a design for a tweet document as:

{
    'hashtags': [ '#foo', '#bar' ],
    'urls': [ "http://url1.example.com", 'http://url2.example.com' ],
    'user_mentions' : [ 'queen_uk' ],
    'geodata': { ... },
    'userid': 'derickr',
    'somegroupid' : 40
}

And then for a user collection, the documents could look like:

{
    'userid' : 'derickr',
    'realname' : Derick Rethans',
    ...
}

Upvotes: 1

Related Questions