Flask + Pyspark: duplicate spark session

Question

I am using PySpark with Flask, in order to have a web service.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from flask import Flask, jsonify
from pyspark import SparkFiles
from pyspark.ml import PipelineModel
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DoubleType, StringType

app = Flask(__name__)

# DEFINE SPARK SESSION
spark = SparkSession \
    .builder \
    .appName("app") \
    .master("") \
    .config("spark.cores.max", 4) \
    .config("spark.executor.memory", "6g") \
    .getOrCreate()

# LOAD THE REQUIRED FILES
modelEnglish = PipelineModel.load(hdfsUrl + "model-english")
ticketEnglishDf = spark.read.parquet(hdfsUrl + "ticket-df-english.parquet").cache()

modelFrench = PipelineModel.load(hdfsUrl + "model-french")
ticketFrenchDf = spark.read.parquet(hdfsUrl + "ticket-df-french.parquet").cache()

def getSuggestions(ticketId, top = 10):
    # ...

    return ...

@app.route("/suggest/")
def suggest(ticketId):
    response = {"id": ticketId, "suggestions": getSuggestions(ticketId)}
    return jsonify(response)

if __name__ == "__main__":

    app.run(debug=True, host="127.0.0.1", port=2793, threaded=True)

This works well, when a request is send to the server. But the Spark job are duplicated... and I don't know why ?

I have already tried to create the spark session inside the condition block if __name__ == "__main__":

Midiparse · Accepted Answer

Spark uses RDDs, which are lazy collections. Through calling RDD/Dataframe methods you are actually assembling a transformation pipeline. Computation is only triggered once you run an action, like collect, count or write. Normally (unless you cache a collection) it will be recalculated over and over. Caching doesn't guarantee however that the collection won't be recomputed. See documentation on RDDs and caching.

Using Spark in a server application is horribly wrong in the first place. It is a distributed data processing platform that should be used for batch jobs or streaming. Spark jobs are normally writing a file or a database and processing take several hours (or even days) on multiple machines to finish.

I suppose your model is the output of a Spark ML pipeline. It should be small enough to bundle it with your server application and load with usual file IO tools.

Flask + Pyspark: duplicate spark session

Answers (1)

Related Questions