Sam
Sam

Reputation: 1363

How to run Spark Or Mapreduce job on hourly aggregated data on hdfs produced by spark streaming in 5mins interval

I have a scenario where i am using spark stream to collect data from Kinesis service using https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

Now in streaming i am doing some aggregation on the data and emitting to hdfs. i am able to complete it so far.. now i want a way where i can collect all the last hour data or hourly data and feed to new spark job or mapreduce job and do some aggregations again and send to target analytic service.

query: 1. how to get hourly aggregated data from hdfs to next spark job or mapreduce or any data processing . do we need some partition before we emit from spark to do so. 2.Can we use amazon data pipeline for this. however suppose if we emit data without partition say on /user/hadoop/ folder . how data pipeline can understand it needs to pick last hour data. can we do this applying some constraints on folder name with timestamp etc.

Upvotes: 0

Views: 768

Answers (2)

Junren
Junren

Reputation: 44

I am not sure about your use case. But data pipeline has a sample that work with kinesis. It might give you a hint.

https://github.com/awslabs/data-pipeline-samples/tree/master/samples/kinesis

Upvotes: 1

Suyog
Suyog

Reputation: 1

If you are using Mesos cluster manager, you can take a look of chronos for job scheduling http://nerds.airbnb.com/introducing-chronos/

otherwise for spark standalone cluster, you can simply schedule it through chrontab or from external application.

Upvotes: 0

Related Questions