Map transformation performance spark dataframe vs RDD

Question

I have a four node hadoop cluster(mapr) with 40GB memory each. I need to 'apply' a function on one of the fields of the big dataset (500million rows). The flow of my code is that I read the data from hive table as a spark dataframe and apply the desired function on one of the columns as follows:

schema = StructType([StructField("field1", IntegerType(), False), StructField("field2", StringType(), False),StructField("field3", FloatType(), False)])
udfCos = udf(lambda row: function_call(row), schema)
result = SparkDataFrame.withColumn("temp", udfCos(stringArgument))

The similar RDD version might look like as follows:

result = sparkRDD.map(lambda row: function_call(row))

I would like to improve the performance of this piece of code my making sure the code runs with maximum parallelism and reduced throughput -- I need help in using the spark concepts such as 'repartition' 'parallelism value in the SparkConf' or other approaches, in the context of my problem. Any help is appreciated.

my spark startup parameters:

MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 10 --driver-cores 10 --driver-memory 30g --executor-memory 7g --executor-cores 5 --conf spark.driver.maxResultSize="0" --conf spark.default.parallelism="150"

Map transformation performance spark dataframe vs RDD

Answers (1)

Related Questions