Patrick D
Patrick D

Reputation: 13

What is an appropriate database solution for high throughput updates?

Suppose I am a subscription service and I have a table with each row representing customer data.

I want to build a system that consumes a daily snapshot of customer data. This daily snapshot contains data of all currently existing customers (i.e., there will be rows for new customers and customers that unsubscribed will not appear in this data). I also need to keep track of the duration that each customer has subscribed using start and end times. If a customer re-subscribes, another entry of this start and stop time is updated to that customer. A sample record/schema is shown below.

{
    "CustomerId": "12345",
    "CustomerName": "Bob",
    "MagazineName": "DatabaseBoys",
    "Gender": "Male",
    "Address": "{streetName: \"Sesame Street\", ...}",
    "SubscriptionTimeRanges": [{start:12345678, end: 23456789}, {start:34567890, end: 45678901},...]
}

I know that DynamoDB would be quick at handling this kind of use case, and the record schema is right up the NoSQL alley. I can use global secondary indexes / local secondary indexes to resolve some of my issues. I have some experience in PostgreSQL when using Redshift, but I mostly dealt with bulk inserts with no need for data modification. Now I need the data modification aspect. I'm thinking RDS Postgres would be nice for this, but would love to hear your thoughts or opinions.

P.S. don't take the "subscription" system design too seriously, it's the best parallel example I could think of when setting an example for similar requirements.. :)

Upvotes: 0

Views: 1132

Answers (2)

John Rotenstein
John Rotenstein

Reputation: 269282

250,000 rows of data processed daily probably does not justify use of Amazon Redshift. It has a sweetspot of millions to billions of rows and is typically used when you want to do queries throughout the day.

If an RDS database suits your needs, then go for it! If you wish to save cost, you could accumulate the records in Amazon S3 throughout the day and then just load and process the data once per day, turning off the database when it isn't required. (Or even terminate it and launch a new one the next day, since it seems that you don't need to access historical data.)

Amazon Athena might even suit your needs, reading the daily data from S3 and not even needing a persistent database.

Upvotes: 0

Michael - sqlbot
Michael - sqlbot

Reputation: 179054

This is a subjective question, but objectively speaking, DynamoDB is not designed for scans. It can do them, but it requires making repeated requests in a loop, starting each request where the last one left off. This isn't quick for large data sets, so there's also parallel scan but you have to juggle the threads and you consume a lot of table throughput with this.

On the flip side, it is easy and inexpensive to prototype and test against DynamoDB using the SDKs.

But with the daily need to scan the data, and potential need for joins, I would be strongly inclined to go with a relational database.

Upvotes: 1

Related Questions