Anand Kannan
Anand Kannan

Reputation: 141

How many Spark Session to create?

We are building a data ingestion framework in pyspark. The first step is to get/create a sparksession with our app name. The structure of dataLoader.py is outlined below.

spark = SparkSession \
            .builder \
            .appName('POC') \
            .enableHiveSupport() \
            .getOrCreate()
 #create data frame from file
 #process file 

If i have to execute this dataLoader.py concurrently for loading different files, would having the same spark session cause an issue? Do I have to create a separate spark session for every ingestion?

Upvotes: 5

Views: 4970

Answers (4)

Kuldip Puri Tejaswi
Kuldip Puri Tejaswi

Reputation: 452

You want to create a new spark application for every file which is certainly possible as each spark application would have 1 corresponding spark session, it is not the recommended way though (usually).You can load multiple files using the same spark session object which is preferred (usually).

Upvotes: 0

Denis Makarenko
Denis Makarenko

Reputation: 2938

Yet another option is to create a Spark session once, share it among several threads and enable FAIR job scheduling. Each of the threads would execute a separate spark job, i.e. calling collect or other action on a data frame. The optimal number of threads depends on complexity of your job and the size of the cluster. If there are too few jobs, the cluster can be underloaded and wasting its resources. If there are too many threads, the cluster will be saturated and some jobs will be sitting idle and waiting for executors to free up.

Upvotes: 2

sdman
sdman

Reputation: 156

Each spark job is independent and there can only be one instance of SparkSession ( and SparkContext ) per JVM. You won't be able to create multiple session instances.

Upvotes: 0

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6974

No, you don't create multiple spark session. Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.

If you want to load different files using the dataLoader.pythere are 2 options

  1. Load and process files sequentially. Here you load one file at a time; save that to a dataframe and process that dataframe.

  2. Create different dataLoader.py script for different files and run each spark job in parallel. Here each spark job gets its own sparkSession.

Upvotes: 1

Related Questions