Reputation: 141
We are building a data ingestion framework in pyspark. The first step is to get/create a sparksession with our app name. The structure of dataLoader.py is outlined below.
spark = SparkSession \
.builder \
.appName('POC') \
.enableHiveSupport() \
.getOrCreate()
#create data frame from file
#process file
If i have to execute this dataLoader.py concurrently for loading different files, would having the same spark session cause an issue? Do I have to create a separate spark session for every ingestion?
Upvotes: 5
Views: 4970
Reputation: 452
You want to create a new spark application for every file which is certainly possible as each spark application would have 1 corresponding spark session, it is not the recommended way though (usually).You can load multiple files using the same spark session object which is preferred (usually).
Upvotes: 0
Reputation: 2938
Yet another option is to create a Spark session once, share it among several threads and enable FAIR job scheduling. Each of the threads would execute a separate spark job, i.e. calling collect or other action on a data frame. The optimal number of threads depends on complexity of your job and the size of the cluster. If there are too few jobs, the cluster can be underloaded and wasting its resources. If there are too many threads, the cluster will be saturated and some jobs will be sitting idle and waiting for executors to free up.
Upvotes: 2
Reputation: 156
Each spark job is independent and there can only be one instance of SparkSession ( and SparkContext ) per JVM. You won't be able to create multiple session instances.
Upvotes: 0
Reputation: 6974
No, you don't create multiple spark session. Spark session should be created only once per spark application. Spark doesn't support this and your job might will fail if you use multiple spark session in the same spark job. Here is the SPARK-2243 where spark has closed the ticket saying it won't fix it.
If you want to load different files using the dataLoader.py
there are 2 options
Load and process files sequentially. Here you load one file at a time; save that to a dataframe and process that dataframe.
Create different dataLoader.py
script for different files and run each spark job in parallel. Here each spark job gets its own sparkSession.
Upvotes: 1