user1050619
user1050619

Reputation: 20906

Run spark job in multiple nodes

Im trying to run a sample spark job and its working fine.Now, I need to run the same job in multiple nodes in a cluster. What needs to be changed in my program to indicate to run it in multiple nodes.

from pyspark import SparkConf, SparkContext
import collections

#conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
conf = SparkConf().setMaster("hadoop-master").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)

#lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")
lines = sc.textFile("hdfs://hadoop-master:8020/user/hduser/gutenberg/ml-100k/u.data")
ratings = lines.map(lambda x: x.split()[2])
result = ratings.countByValue()

sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
    print("%s %i" % (key, value))

Upvotes: 1

Views: 2482

Answers (1)

Mariusz
Mariusz

Reputation: 13946

The only option needed change in code is master of spark context. To run script on hadoop you need to place HADOOP_CONF_DIR in the environment and set master to yarn. All of this is explained in documentation: http://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn

Upvotes: 3

Related Questions