Inserting large amount of data into Redshift

Question

I'm looking to build a RESTful API in Go that would be in charge of inserting datas based on the data sent by multiple mobile apps (that would be stored in an Amazon Redshift cluster). Possibly receiving tens of thousands of requests per second.

From what I have read, Redshift give slow insert speeds. That's why few people have advised me to use an intermediate database like dynamodb or s3 in which I'd perform the inserts first. Then, in a second time, I'd import the data to Redshift.

I'm wondering why would I need to use Redshift in that case as the data would already be stored in a database ? Do you think I can proceed differently ?

I have also thought of a simpler solution by writing to a queue and progressively inserting the data to redshift but I think it might be a problem if the queue gets increasingly bigger as the insert speed isn't fast enough to compensate the incoming data.

Thanks in advance for your help! :-)

John Rotenstein · Accepted Answer

Advice like this is normally off-topic for StackOverflow, but...

Amazon Redshift is a massively parallel processing (MPP) database with an SQL interface. It can be used to query TBs and even PBs of data and it can do it very efficiently.

You ask "why would I need to use Redshift" -- the answer is if your querying requirements cannot be satisfied with a traditional database. If you can satisfactorily use a normal database for your queries, then there's no real reason to use Redshift.

However, if your queries need Redshift, then you should continue to use it. The design of Redshift is such that the most efficient way to insert data is to load from Amazon S3 via the COPY command. It is inefficient to insert data via normal INSERT statements unless they are inserting many rows per INSERT statement (eg hundreds or thousands).

So, some questions to ask:

Do I need the capabilities of Amazon Redshift for my queries, or can a traditional database suffice?
Do I need to load data in real-time, or is it sufficient to load in batches?
If using batches, how often do I need to load the batch? Can I do it hourly or daily, or does it need to be within a few minutes of the data arriving?

You could also consider using Amazon Kinesis Firehose, which can accept a stream of data and insert it into an Amazon Redshift database automatically.

Inserting large amount of data into Redshift

Answers (1)

Related Questions