How to use AWS Kinesis streams for multiple different data sources

Question

We have a traditional batch application where we ingest data from multiple sources (Oracle, Salesforce, FTP Files, Web Logs etc.). We store the incoming data in S3 bucket and run Spark on EMR to process data and load on S3 and Redshift.

Now we are thinking of making this application near real time by bringing in AWS Kinesis and then using Spark Structured Streaming from EMR to process streaming data and load it to S3 and Redshift. Given that we have different variety of data e.g. 100+ tables from Oracle, 100+ salesforce objects, 20+ files coming from FTP location, Web Logs etc. what is the best way to use AWS Kinesis here.

1) Using Separate Stream for each source (Salesforce, Oracle, FTP) and then using a separate shard (within a stream) for each table/ object - Each consumer reads from its own shard which has a particular table/ file 2) Using a separate stream for each table/ object - We will end up having 500+ streams in this scenario. 3) Using a single stream for everything - not sure how the consumer app will read data in this scenario.

How to use AWS Kinesis streams for multiple different data sources

Answers (1)

Related Questions