Reputation: 8678
I have to design a setup to read incoming data from twitter (streaming). I have decided to use Apache Kafka with Spark streaming for real time processing. It is required to show analytics in a dashboard. Now, being a newbie is this domain, My assumed data rate will be 10 Mb/sec maximum. I have decided to use 1 machine for Kafka of 12 cores and 16 GB memory. *Zookeeper will also be on same machine. Now, I am confused about Spark, it will have to perform streaming job analysis only. Later, analyzed data output is pushed to DB and dashboard. Confused list:
Upvotes: 1
Views: 364
Reputation: 143
Try answer:
- Should I run Spark on Hadoop cluster or local file system ?
recommend use hdfs,it can can save more data, ensure High availability.
- Is standalone mode of Spark can fulfill my requirements ?
Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
YARN will likely be preinstalled in many Hadoop distributions.such as CDH HADOOP. so recommend use
YARN doesn’t need to run a separate ZooKeeper Failover Controller.
so recommend yarn
- Is my approach is appropriate or what should be best in this case ?
If you data not more than 10 million ,I think can use use local cluster to do it. local mode avoid many nodes shuffle. shuffles between processes are faster than shuffles between nodes.
else recommend use greater than or equal 3 nodes,That is real Hadoop cluster.
As a spark elementary players,this is my understand. I hope ace corrects me.
Upvotes: 1