Reputation: 1363
I have a scenario where i am using spark stream to collect data from Kinesis service using https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
Now in streaming i am doing some aggregation on the data and emitting to hdfs. i am able to complete it so far.. now i want a way where i can collect all the last hour data or hourly data and feed to new spark job or mapreduce job and do some aggregations again and send to target analytic service.
query: 1. how to get hourly aggregated data from hdfs to next spark job or mapreduce or any data processing . do we need some partition before we emit from spark to do so. 2.Can we use amazon data pipeline for this. however suppose if we emit data without partition say on /user/hadoop/ folder . how data pipeline can understand it needs to pick last hour data. can we do this applying some constraints on folder name with timestamp etc.
Upvotes: 0
Views: 768
Reputation: 44
I am not sure about your use case. But data pipeline has a sample that work with kinesis. It might give you a hint.
https://github.com/awslabs/data-pipeline-samples/tree/master/samples/kinesis
Upvotes: 1
Reputation: 1
If you are using Mesos cluster manager, you can take a look of chronos for job scheduling http://nerds.airbnb.com/introducing-chronos/
otherwise for spark standalone cluster, you can simply schedule it through chrontab or from external application.
Upvotes: 0