Reputation: 21
We have a traditional batch application where we ingest data from multiple sources (Oracle, Salesforce, FTP Files, Web Logs etc.). We store the incoming data in S3 bucket and run Spark on EMR to process data and load on S3 and Redshift.
Now we are thinking of making this application near real time by bringing in AWS Kinesis and then using Spark Structured Streaming from EMR to process streaming data and load it to S3 and Redshift. Given that we have different variety of data e.g. 100+ tables from Oracle, 100+ salesforce objects, 20+ files coming from FTP location, Web Logs etc. what is the best way to use AWS Kinesis here.
1) Using Separate Stream for each source (Salesforce, Oracle, FTP) and then using a separate shard (within a stream) for each table/ object - Each consumer reads from its own shard which has a particular table/ file 2) Using a separate stream for each table/ object - We will end up having 500+ streams in this scenario. 3) Using a single stream for everything - not sure how the consumer app will read data in this scenario.
Upvotes: 1
Views: 3396
Reputation: 81454
Kinesis does not care what data you put into a stream, data is just a blob to Kinesis. It will be up to you to determine (code) the writers and readers for a stream. You could intermix different types of data into one stream, the consumer will then need to figure out what each blob is and what to do with it.
I would break this into multiple streams based upon data type and priority of the data. This will make implementation and debugging a lot easier.
I think you are misunderstanding what shards are. They are for performance and not for data separation.
Upvotes: 6