What is an appropriate database solution for high throughput updates?

Question

Suppose I am a subscription service and I have a table with each row representing customer data.

I want to build a system that consumes a daily snapshot of customer data. This daily snapshot contains data of all currently existing customers (i.e., there will be rows for new customers and customers that unsubscribed will not appear in this data). I also need to keep track of the duration that each customer has subscribed using start and end times. If a customer re-subscribes, another entry of this start and stop time is updated to that customer. A sample record/schema is shown below.

{
    "CustomerId": "12345",
    "CustomerName": "Bob",
    "MagazineName": "DatabaseBoys",
    "Gender": "Male",
    "Address": "{streetName: \"Sesame Street\", ...}",
    "SubscriptionTimeRanges": [{start:12345678, end: 23456789}, {start:34567890, end: 45678901},...]
}

I will be processing >250,000 rows of data once a day, every day
I need to know whether any record in the snapshot doesn't currently exist in the database
The total size of the table will be >250,000
There are longer-term benefits that would come from having a relational database (e.g., joining to another table that contains Magazine information)
I would like to get records by either CustomerId or MagazineName
Writes should not block reads
To achieve this, I anticipate needing to scan the entire table, iterating over every record, and individually updating the SubscriptionTimeRanges array/JSON blob of each record
Latency for writes is not a hard requirement, but at the same time, I shouldn't be expecting to take over an hour to update all of these records (could this be done in a single transaction if it's an update...?)
Reads should also be quick
Concurrent processing is always nice, but that may introduce locking for ACID compliant dbs?

I know that DynamoDB would be quick at handling this kind of use case, and the record schema is right up the NoSQL alley. I can use global secondary indexes / local secondary indexes to resolve some of my issues. I have some experience in PostgreSQL when using Redshift, but I mostly dealt with bulk inserts with no need for data modification. Now I need the data modification aspect. I'm thinking RDS Postgres would be nice for this, but would love to hear your thoughts or opinions.

P.S. don't take the "subscription" system design too seriously, it's the best parallel example I could think of when setting an example for similar requirements.. :)

Michael - sqlbot · Accepted Answer

This is a subjective question, but objectively speaking, DynamoDB is not designed for scans. It can do them, but it requires making repeated requests in a loop, starting each request where the last one left off. This isn't quick for large data sets, so there's also parallel scan but you have to juggle the threads and you consume a lot of table throughput with this.

On the flip side, it is easy and inexpensive to prototype and test against DynamoDB using the SDKs.

But with the daily need to scan the data, and potential need for joins, I would be strongly inclined to go with a relational database.

What is an appropriate database solution for high throughput updates?

Answers (2)

Related Questions