VB_
VB_

Reputation: 45692

Lambda architecture on AWS: choose database for batch layer

We're building Lambda architecture on AWS stack. A lack of devops knowledge forces us to prefer AWS managed solution over custom deployments.

Our workflow:

[Batch layer] 
Kinesys Firehouse -> S3 -Glue-> EMR (Spark) -Glue-> S3 views -----+                                                   
                                                                  |===> Serving layer (ECS) => Users                                                
Kinesys -> EMR (Spark Streaming) -> DynamoDB/ElasticCache views --+
[Speed layer]

We have already using 3 datastores: ElasticCache, DynamoDB and S3 (queried with Athena). Bach layer produce from 500,000 up to 6,000,000 row each hour. Only last hour results should be queried by serving layer with low latency random reads.

Neither of our databases fits batch-insert & random-read requirements. DynamoDB not fit batch-insert - it's too expensive because of throughput required for batch inserts. Athena is MPP and moreover has limitation of 20 concurrent queries. ElasticCache is used by streaming layer, not sure if it's good idea to perform batch inserts there.

Should we introduce the fourth storage solution or stay with existing?

Considered options:

  1. Persist batch output to DynamoDB and ElasticCache (part of data that is updated rarely and can be compressed/aggregated goes to DynamoDB; frequently updated data ~8GB/day goes to elasticCache).
  2. Introduce another database (HBase on EMR over S3/ Amazon redshift?) as a solution
  3. Use S3 Select over parquet to overcome Athena concurrent query limits. That will also reduce query latency. But have S3 Select any concurrent query limits? I can't find any related info.

The first option is bad because of batch insert to ElasticCache used by streaming. Also does it follow Lambda architecture - keeping batch and speed layer views in the same data stores?

The second solution is bad because of the fourth database storage, isn't it?

Upvotes: 0

Views: 277

Answers (1)

Jens Roland
Jens Roland

Reputation: 27770

In this case you might want to use something like HBase or Druid; not only can they handle batch inserts and very low latency random reads, they could even replace the DynamoDB/ElastiCache component from your solution, since you can write directly to them from the incoming stream (to a different table).

Druid is probably superior for this, but as per your requirements, you'll want HBase, as it is available on EMR with the Amazon Hadoop distribution, whereas Druid doesn't come in a managed offering.

Upvotes: 1

Related Questions