Reputation: 83
I am using the Apache spark for loading the dataset from the Oracle Database using the jdbc format in Spark-Java.I am loading a dataset of the size 10M and 1M and performing various operations on the dataset.After all the processing is done,I am writing multiple datasets into DB,which consumes almost 90% of the time. So,how to reduce this write speed into the Oracle database. I tried various combinations of the executors-memory,cores,number of executors,but not a major difference obtained. Command used to run the apache spark built jar file is:
./spark-submit --class com.sample.Transformation --conf spark.sql.shuffle.partitions=5001 --num-executors=22 --executor-cores=8 --executor-memory=10GB --jars /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-api-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/drools-core-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/drools-compiler-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-maven-support-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-internal-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/xstream-1.4.10.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-commons-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/ecj-4.4.2.jar,/scratch/rmbbuild/spark_ormb/drools-jars/mvel2-2.4.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-commons-7.7.0.Final.jar,/scratch/rmbbuild/spark_ormb/drools-jars/kie-soup-project-datamodel-api-7.7.0.Final.jar --driver-class-path /scratch/rmbbuild/spark_ormb/drools-jars/ojdbc6.jar --master spark://10.180.182.218:7077 "/scratch/rmbbuild/spark_ormb/POC-jar/Transformation-0.0.1-SNAPSHOT.jar" > /scratch/rmbbuild/spark_ormb/POC-jar/logs/logs10.txt
txndetailDS.write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DETAIL).save();
finalpriceItemParamsGrpDS.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP_K).save();
finalpriceItemParamsGroupTable.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_PRICEITEM_PARM_GRP).save();
finalPritmOutput.distinct().write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL_PRITM).save();
outputDataSetforsummary.write().mode("append").format("jdbc").option("url", connection).option("dbtable", CI_TXN_DTL_PRITM_SUMMARY).save();
Upvotes: 0
Views: 1867
Reputation: 4623
Set the partition with the spark-submit:
spark.sql.shuffle.partitions=5000
conf spark.default.parallelism=5000
The partition should be in multiple of no of cores
(8*625 = 5000)
If you don't have large data then don't have large number of partitions.
You can check more on optimizing job here
You can set these properties also:
batchsize
: The JDBC batch size, which determines how many rows to insert per round trip. This can help performance on JDBC drivers. This option applies only to writing. It defaults to 1000.
numPartitions
:The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections.
Upvotes: 0
Reputation: 492
Try to reconfigure your total number of executors to not exceed available memory, use only one core per executor this way you can achieve better performance and parallelism.
--num-executors=40 --executor-cores=1 --executor-memory=5G
Keep in mind that JDBC connection is single threaded per executor and it makes to use just one core for that reason. With above configuration 40 parallel connections would be established to the RDBMS.
Also, I suggest using .option("batchsize", "10000") so that insertion would be faster that is 10000 records at once rather than default 1000.
Upvotes: 1