Reputation: 79
Moving from RDBMS and I am not sure how best to design for below scenario
I have a table with around 200,000 questions with question id as partition key.
Users view questions and i do not wish to show viewed question again to the user. So which one is better option?
1 and 2 might have a problem with the 400 kb size limit for item. The third seems better option though i would end up with 100 million items as there will be one row per user per question viewed. but i assume this is not a problem for dynamo?
Another problem is how to get 10 random questions not viewed by the user. Do i generate 10 random numbers between 1 and 200,000 (the number of questions) and then check if not in table mentioned in point 3 above?
Upvotes: 2
Views: 84
Reputation: 751
I definitely would not go with option 1 or 2 for the reason you mentioned: you would already be limiting your scalability by the 400kb limit. With a UUID of 128 bits, you would be limited to about 250 users per question.
Option 3 is the way to go with DynamoDB, but what you need to consider is what is the partition key and what is the range key. You could have user_id as the partition key and question_id as the range key. The answer to that decision depends on how your data is going to be accessed. DynamoDB divides the total table throughput by each partition key: each one of your n partition keys gets 1/nth of the table throughput. For example, if you have a subset of partition keys that are accessed more than the others, then you won't be efficiently utilizing your table throughput because those partition keys that actually use up less than 1/nth of the throughput are still provisioned for 1/nth of the throughput. The general idea is that you want to have the each of your partition keys utilized equally. I think that you have it correct, I'm assuming that each question is given randomly and is no more popular than another, while some users might be more active than others.
The other part of your question is a little bit more difficult to answer / determine. You could do it your way where you have tables that contain question and user pairs for the questions that those users have read or you could have tables that contain the pairs for the questions those users haven't read. The tradeoff here is between initial write cost and subsequent read cost, and the answer depends on the amount of questions that you have compared to the consumption rate.
When you have a large amount of questions compared to the rate that users will progress through them, the chances of randomly selecting an already chosen one are small, so you're going to want to store have-read question-user pairs. With this setup you don't pay a lot to initialize a user (you don't have to write a question-user pair for each question) and you won't have a lot of miss-read costs (i.e. where you select a question-user pair and it turns out they already read it, this still consumes read-write units).
If you have a small amount of questions compared to the rate that users consume them, then you're going to want to store the haven't-read question-user pairs. You pay something to initialize each user (writing in one question-user pair for each question), but then you don't have any accidental miss-reads. If you stored them as have-read pairs when their is a small amount of questions, then you will encounter a lot of miss-reads as the percentage of read questions approaches 100% (to the point where you would have been better off just setting them up as haven't-read pairs).
I hope this helps with your design considerations. Drop a comment if you need clarification!
Upvotes: 1