Reputation: 2179
My Spark application launches several executors. I have several partitions that get spread over my executors.
When using map() on these partitions, I want to use a MongoDB connection (MongoDB Java Driver) and query more data from there, process this data and return it as the output of the map() function.
I want to create one connection per executor. Each partition should then access this executor-local variable and use it to query the data.
Establishing a connection for each partition is probably not a good idea. Broadcasting the connection won't work either because it is not serializable (I think?).
To sum it up:
Upvotes: 1
Views: 426
Reputation: 18111
You should use the MongoConnector
.
It will handle creating a collection and is backed by a cache that efficiently handles the shutdown of any MongoClients. It is serialisable so it can be a broadcast and it can take options, a readConfig or the Spark context to configure where to connect to.
MongoConnector
uses the loan pattern to handle reference management of the underlying connection to MongoDB and allows access at the MongoClient
, MongoDatabase
or the MongoCollection
level.
Upvotes: 1