Sudha
Sudha

Reputation: 339

Why can't we create an RDD using Spark session

We see that,

Spark context available as 'sc'.
Spark session available as 'spark'.

I read spark session includes spark context, streaming context, hive context ... If so, then why are we not able to create an rdd by using a spark session instead of a spark context.

scala> val a = sc.textFile("Sample.txt")
17/02/17 16:16:14 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
a: org.apache.spark.rdd.RDD[String] = Sample.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> val a = spark.textFile("Sample.txt")
<console>:23: error: value textFile is not a member of org.apache.spark.sql.SparkSession
       val a = spark.textFile("Sample.txt")

As shown above, sc.textFile succeeds in creating an RDD but not spark.textFile.

Upvotes: 12

Views: 10583

Answers (3)

ansraju
ansraju

Reputation: 304

It can be created in the following way-

val a = spark.read.text("wc.txt") This will create a dataframe,If you want to convert it to RDD then use- a.rdd Please refer the link below,on dataset API- http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html

Upvotes: 1

watsonic
watsonic

Reputation: 3373

In Spark 2+, Spark Context is available via Spark Session, so all you need to do is:

spark.sparkContext().textFile(yourFileOrURL)

see the documentation on this access method here.

Note that in PySpark this would become:

spark.sparkContext.textFile(yourFileOrURL)

see the documentation here.

Upvotes: 10

Pawan B
Pawan B

Reputation: 4623

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s.

For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext.

But as DataSet and Dataframe API’s are becoming new standard API’s Spark need an entry point build for them. So in Spark 2.0, Spark have a new entry point for DataSet and Dataframe API’s called as Spark Session.

SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext.

All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

sparkContext still contains the method which it had in previous version .

methods of sparkSession can be found here

Upvotes: 8

Related Questions