Spark - How to create a variable that is different for each executor context?

Question

My Spark application launches several executors. I have several partitions that get spread over my executors.

When using map() on these partitions, I want to use a MongoDB connection (MongoDB Java Driver) and query more data from there, process this data and return it as the output of the map() function.

I want to create one connection per executor. Each partition should then access this executor-local variable and use it to query the data.

Establishing a connection for each partition is probably not a good idea. Broadcasting the connection won't work either because it is not serializable (I think?).

To sum it up:

How to create a variable that is different for each executor context?

Ross · Accepted Answer

You should use the MongoConnector.

It will handle creating a collection and is backed by a cache that efficiently handles the shutdown of any MongoClients. It is serialisable so it can be a broadcast and it can take options, a readConfig or the Spark context to configure where to connect to.

MongoConnector uses the loan pattern to handle reference management of the underlying connection to MongoDB and allows access at the MongoClient, MongoDatabase or the MongoCollection level.

Spark - How to create a variable that is different for each executor context?

Answers (1)

Related Questions