ben10
ben10

Reputation: 83

Using Apache Spark for fast computation process but writing into Database consumes huge amount of time

I am using the Apache spark for loading the dataset from the Oracle Database using the jdbc format in Spark-Java.I am loading a dataset of the size 10M and 1M and performing various operations on the dataset.After all the processing is done,I am writing multiple datasets into DB,which consumes almost 90% of the time. So,how to reduce this write speed into the Oracle database. I tried various combinations of the executors-memory,cores,number of executors,but not a major difference obtained. Command used to run the apache spark built jar file is:

./spark-submit --class com.sample.Transformation --conf spark.sql.shuffle.partitions=5001 --num-executors=22 --executor-cores=8 --executor-memory=10GB --jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-api-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/drools-core-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/drools-compiler-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-maven-support-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-internal-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/xstream-1.4.10.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-commons-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/ecj-4.4.2.jar,/scratch/rmbbuild/spark_ormb/drools-jars/mvel2-2.4.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-commons-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-api-7.7.0.Final.jar --driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar --master spark://10.180.182.218:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Transformation-0.0.1-SNAPSHOT.jar" > /scratch/rmbbuild/spark_ormb/POC-jar/logs/logs10.txt

Write methods in Transformation.java file is as follows:

txndetailDS.write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DETAIL).save();

    finalpriceItemParamsGrpDS.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP_K).save();

    finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP).save();

    finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL_PRITM).save();
            outputDataSetforsummary.write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL_PRITM_SUMMARY).save();

So,if anyone have faced this issue can help me out?

Or how to get the optimized performance tuning for spark-submit as well for this scenarios,so that the write into the Database gets speeded up...

Upvotes: 0

Views: 1867

Answers (2)

Pawan B
Pawan B

Reputation: 4623

Set the partition with the spark-submit:

spark.sql.shuffle.partitions=5000
conf spark.default.parallelism=5000

The partition should be in multiple of no of cores (8*625 = 5000)

If you don't have large data then don't have large number of partitions.

You can check more on optimizing job here

You can set these properties also:

batchsize: The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.

numPartitions:The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections.

Upvotes: 0

Mansoor Baba Shaik
Mansoor Baba Shaik

Reputation: 492

Try to reconfigure your total number of executors to not exceed available memory, use only one core per executor this way you can achieve better performance and parallelism.

--num-executors=40 --executor-cores=1 --executor-memory=5G

Keep in mind that JDBC connection is single threaded per executor and it makes to use just one core for that reason. With above configuration 40 parallel connections would be established to the RDBMS.

Also, I suggest using .option("batchsize", "10000") so that insertion would be faster that is 10000 records at once rather than default 1000.

Upvotes: 1

Related Questions