Barry
Barry

Reputation: 15

Writing data into MongoDB with Spark

When I tried to write a spark dataframe into mongodb, I found that spark only create one task to do it. This cause bad performance because only one executor is actually running even if allocate many executors in the job.

my partial pyspark code:

df.write.format("com.mongodb.spark.sql.DefaultSource") \
    .mode("append") \
    .option("spark.mongodb.output.uri", connectionString) \
    .save()

Could spark running multiple task in this case? Thanks

Spark submit:

spark-submit --master yarn --num-executors 3 --executor-memory 5g --jars $JARS_PATH/mongo-java-driver-3.5.0.jar,$JARS_PATH/mongodb-driver-core-3.5.0.jar,$JARS_PATH/mongo-spark-connector_2.11-2.2.1.jar spark-mongo.py

I found log that contain this INFO

INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, linxtd-itbigd04, executor 1, partition 0, PROCESS_LOCAL, 4660 bytes)
INFO BlockManagerMasterEndpoint: Registering block manager linxtd-itbigd04:36793 with 1458.6 MB RAM, BlockManagerId(1, linxtd-itbigd04, 36793, None)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on linxtd-itbigd04:36793 (size: 19.7 KB, free: 1458.6 MB)
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 17364 ms on linxtd-itbigd04 (executor 1) (1/1)

Upvotes: 1

Views: 1234

Answers (1)

eliasah
eliasah

Reputation: 40360

Like I suspected, and mentioned in the comments, your data wasn’t partitioned thus Spark created one task to deal with it.

You have to be careful when using the jdbc source if you don’t provide a partition reading and writing data wont be parallelized and you end up with one task.

You can read more about this topic in one of my spark gotchas - Reading data using jdbc source.

Disclaimer: I’m on of the co-authors of that repo.

Upvotes: 1

Related Questions