Mike
Mike

Reputation: 197

Map transformation performance spark dataframe vs RDD

I have a four node hadoop cluster(mapr) with 40GB memory each. I need to 'apply' a function on one of the fields of the big dataset (500million rows). The flow of my code is that I read the data from hive table as a spark dataframe and apply the desired function on one of the columns as follows:

schema = StructType([StructField("field1", IntegerType(), False), StructField("field2", StringType(), False),StructField("field3", FloatType(), False)])
udfCos = udf(lambda row: function_call(row), schema)
result = SparkDataFrame.withColumn("temp", udfCos(stringArgument))

The similar RDD version might look like as follows:

result = sparkRDD.map(lambda row: function_call(row))

I would like to improve the performance of this piece of code my making sure the code runs with maximum parallelism and reduced throughput -- I need help in using the spark concepts such as 'repartition' 'parallelism value in the SparkConf' or other approaches, in the context of my problem. Any help is appreciated.

my spark startup parameters:

MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 10 --driver-cores 10 --driver-memory 30g --executor-memory 7g --executor-cores 5 --conf spark.driver.maxResultSize="0" --conf spark.default.parallelism="150"

Upvotes: 1

Views: 877

Answers (1)

Bhavesh
Bhavesh

Reputation: 919

For tuning you application you need to know few things

1) You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created

Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.

2) Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application

Form Spark point of you

In spark-defaults.conf

you can specify what kind of serialization is needed how much Driver Memory and Executor Memory needed by your application even you can change Garbage collection algorithm.

Below are few Example you can tune this parameter based on your requirements

spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions  -XX:MaxPermSize=2G -XX:+UseG1GC
spark.driver.extraJavaOptions    -XX:MaxPermSize=6G -XX:+UseG1GC

For More details refer http://spark.apache.org/docs/latest/tuning.html

Upvotes: 0

Related Questions