SparkContext and SparkSession : How to get the "parallelizePairs()"?

Question

I'm a newbie on Spark and need to parallelizePairs() (working on Java).

First, I've started my driver with:

SparkSession spark = SparkSession
        .builder()
        .appName("My App")
        .config("driver", "org.postgresql.Driver")
        .getOrCreate();

But spark don't have the function I need. Just parallelize() thru spark.sparkContext()

Now I'm tempted to add

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("My App");
JavaSparkContext context = new JavaSparkContext(sparkConf);

This way, context have the function I need but I'm very confusing here.

First, I never needed JavaSparkContext because I'm running using spark-submit and setting the master address there.

Second, why spark.sparkContext() is not the same of JavaSparkContext and how to get it using the SparkSession?

If I'm passing the master in command line, must I set sparkConf.setMaster( '' )?

I already read this: How to create SparkSession from existing SparkContext and undesrtood the problem but I realy need the builder way because I need to pass the .config("driver", "org.postgresql.Driver") to it.

Please some light here...

EDIT

    Dataset graphDatabaseTable = spark.read()
            .format("jdbc")
            .option("url", "jdbc:postgresql://192.168.25.103:5432/graphx")
            .option("dbtable", "public.select_graphs")
            .option("user", "postgres")
            .option("password", "admin")
            .option("driver", "org.postgresql.Driver")
            .load();        
    SQLContext graphDatabaseContext = graphDatabaseTable.sqlContext();
    graphDatabaseTable.createOrReplaceTempView("select_graphs");

    String sql = "select * from select_graphs where parameter_id = " + indexParameter;          
    Dataset graphs = graphDatabaseContext.sql(sql);

Alper t. Turker · Accepted Answer

Initialize JavaSparkContext using existing SparkContext:

JavaSparkContext context = JavaSparkContext(spark.sparkContext());

why spark.sparkContext() is not the same of JavaSparkContext and how to get it using the SparkSession

In short, because Scala is much richer language than Java and JavaSparkContext is a convenience wrapper, designed to get around some Java limitations. At the same time RDD API is moved aside.

If I'm passing the master in command line, must I set sparkConf.setMaster( '' )

No. Precedence is:

configuration files
spark-submit options
SparkConf and SparkContext options.

but I realy need the builder way because I need to pass the .config("driver", "org.postgresql.Driver") to it.

It doesn't look right. driver option is used by DataFrameWriter and DataFrameReader. It should be passed there.

SparkContext and SparkSession : How to get the "parallelizePairs()"?

Answers (2)

Related Questions

SparkContext and SparkSession : How to get the &quot;parallelizePairs()&quot;?

Answers (2)

Related Questions

SparkContext and SparkSession : How to get the "parallelizePairs()"?