Reputation: 3835
I tried to create a standalone PySpark program that reads a csv and stores it in a hive table. I have trouble configuring Spark session, conference and contexts objects. Here is my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import *
conf = SparkConf().setAppName("test_import")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.config(conf=conf)
dfRaw = spark.read.csv("hdfs:/user/..../test.csv",header=False)
dfRaw.createOrReplaceTempView('tempTable')
sqlContext.sql("create table customer.temp as select * from tempTable")
And I get the error:
dfRaw = spark.read.csv("hdfs:/user/../test.csv",header=False) AttributeError: 'Builder' object has no attribute 'read'
Which is the right way to configure spark session object in order to use read.csv command? Also, can someone explain the diference between Session, Context and Conference objects?
Upvotes: 9
Views: 9728
Reputation: 28322
There is no need to use both SparkContext
and SparkSession
to initialize Spark. SparkSession
is the newer, recommended way to use.
To initialize your environment, simply do:
spark = SparkSession\
.builder\
.appName("test_import")\
.getOrCreate()
You can run SQL commands by doing:
spark.sql(...)
Prior to Spark 2.0.0, three separate objects were used: SparkContext
, SQLContext
and HiveContext
. These were used separatly depending on what you wanted to do and the data types used.
With the intruduction of the Dataset/DataFrame abstractions, the SparkSession
object became the main entry point to the Spark environment. It's still possible to access the other objects by first initialize a SparkSession
(say in a variable named spark
) and then do spark.sparkContext
/spark.sqlContext
.
Upvotes: 10