PySpark very slow on Amazon cluster

Question

I'm working on a Spark project and trying to excecute the app on a cluster on amazon. Performance are very slow also on small file. I don't want a solution, just an opinion about the possible reasons for the experienced slowness.

spark = SparkSession.builder.appName("appName").getOrCreate()
sc = spark.sparkContext

rec= sc.textFile(sys.argv[1])
#  rec= sc.parallelize(records.collect())

a= rec.map(lambda line: line.split("	"))
          .filter(lambda x: int(x[6])>=4)
          .map(lambda x: (x[1],[x[2], x[6]]))
a=a.join(a)
   .filter(lambda (x,(a,b)): a[0]2)
   .sortBy(lambda a: a[0])
   .saveAsTextFile(sys.argv[2])

Garren S · Accepted Answer

This code sequence alone may be causing you the most grief:

records = sc.textFile(sys.argv[1])
rec= sc.parallelize(records.collect())

Why? You're reading the file in as an RDD via the spark context textfile function then it will be materialized across the cluster, then and here's the kicker, by calling records.collect() you're telling the cluster to send ALL the data BACK to the driver (whatever machine launched the job), and finally you'll REBUILD the RDD from that locally collected list. Just use the records RDD in place of the rec RDD

Edit: It looks like you're doing a cross join of a to itself. Is that intended and for what purpose?

Group by key forces a shuffle of all the data in contrast to reduce by key.

Stop using the SparkContext; it's there for backwards compatibility. Use your sparksession's .read.option("delimiter", " ").csv(file path) which will create a DataFrame instead of an RDD, parsing the tab delimited lines for you into generic Row objects and critically then by using the DataFrame APIs you can get far better performance (courtesy of Tungsten and Catalyst). This is especially useful since you're using PySpark because by using Spark 2.x with DataFrames and Python, the performance is in line with scala on the JVM whereas using RDDs with Python means the Python interpreters do the work (re Scala is much faster).

PySpark very slow on Amazon cluster

Answers (2)

Related Questions