Reputation: 191
Please look at the attached screenshot.
I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.
I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.
I tried take and count and don't see much difference in the time for execution.
Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).
df.write.mode("append").partitionBy("date").insertInto(tablename)
Please suggest how we can minimize the time its taking for take and insert into hive table.
Updates:
Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar
its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.
df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)
Thanks,Bab
Upvotes: 0
Views: 992
Reputation: 232
Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.
spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")
Upvotes: 0