Kashif
Kashif

Reputation: 3327

What do the SparkSession appName and getOrCreate functions mean?

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .appName("CorrelationExample") \
        .getOrCreate()

    # $example on$
    data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
            (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
            (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
            (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
    df = spark.createDataFrame(data, ["features"])

    r1 = Correlation.corr(df, "features").head()
    print("Pearson correlation matrix:\n" + str(r1[0]))

    r2 = Correlation.corr(df, "features", "spearman").head()
    print("Spearman correlation matrix:\n" + str(r2[0]))
    # $example off$

    spark.stop()

I'm running the above example code from the spark repo at this link .

What does

spark = SparkSession \
        .builder \
        .appName("CorrelationExample") \
        .getOrCreate()

mean here? What's the purpose of appName here? I don't understand why we would ever need to give the spark session an appName.

Upvotes: 0

Views: 7184

Answers (1)

maxime G
maxime G

Reputation: 1771

appName is the application name, you can see it on spark UI. (it's overwritten by --name <name> when you spark submit in cluster mode), mostly to dissociate your application from others

getOrCreate Will gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in your builder.

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html

Upvotes: 2

Related Questions