codingintherain
codingintherain

Reputation: 263

Efficient DynamoDB schema for time series data

We are building a conversation system that will support messages between 2 users (and eventually between 3+ users). Each conversation will have a collection of users who can participate/view the conversation as well as a collection of messages. The UI will display the most recent 10 messages in a specific conversation with the ability to "page" (progressive scrolling?) the messages to view messages further back in time.

The plan is to store conversations and the participants in MSSQL and then only store the messages (which represents the data that has the potential to grow very large) in DynamoDB. The message table would use the conversation ID as the hash key and the message CreateDate as the range key. The conversation ID could be anything at this point (integer, GUID, etc) to ensure an even message distribution across the partitions.

In order to avoid hot partitions one suggestion is to create separate tables for time series data because typically only the most recent data will be accessed. Would this lead to issues when we need to pull back previous messages for a user as they scroll/page because we have to query across multiple tables to piece together a batch of messages?

Is there a different/better approach for storing time series data that may be infrequently accessed, but available quickly?

Upvotes: 1

Views: 911

Answers (1)

Zach Moshe
Zach Moshe

Reputation: 2980

I guess we can assume that there are many "active" conversations in parallel, right? Meaning - we're not dealing with the case where all the traffic is regarding a single conversation (or a few).

If that's the case, and you're using a random number/GUID as your HASH key, your objects will be evenly spread throughout the nodes and as far as I know, you shouldn't be afraid of skewness. Since the CreateDate is only the RANGE key, all messages for the same conversation will be stored on the same node (based on their ConversationID), so it actually doesn't matter if you query for the latest 5 records or the earliest 5. In both cases it's query using the index on CreateDate.

I wouldn't break the data into multiple tables. I don't see what benefit it gives you (considering the previous section) and it will make your administrative life a nightmare (just imagine changing throughput for all tables, or backing them up, or creating a CloudFormation template to create your whole environment).

I would be concerned with the number of messages that will be returned when you pull the history. I guess you'll implement that by a query command with the ConversationID as the HASH key and order results by CreationDate descending. In that case, I'd return only the first page of results (I think it returns up to 1MB of data, so depends on an average message length, it might be enough or not) and only if the user keeps scrolling, fetch the next page. Otherwise, you might use a lot of your throughput on really long conversations and anyway, the client doesn't really want to get stuck for a long time waiting for megabytes of data to appear on screen..

Hope this helps

Upvotes: 1

Related Questions