Iterator516
Iterator516

Reputation: 287

In PySpark, what's the difference between SparkSession and the Spark-CSV module from Databricks for importing CSV files?

I know 2 ways to import a CSV file in PySpark:

1) I can use SparkSession. Here is my full code in Jupyter Notebook.

from pyspark import SparkContext
sc = SparkContext()

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Spark Session 1').getOrCreate()

df = spark.read.csv('mtcars.csv', header = True)

2) I can use the Spark-CSV module from Databricks.

from pyspark import SparkContext
sc = SparkContext()

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('mtcars.csv')

1) What are the advantages of SparkSession over Spark-CSV?

2) What are the advantages of Spark-CSV over SparkSession?

3) If SparkSession is perfectly capable of importing CSV files, why did Databricks invent the Spark-CSV module?

Upvotes: 2

Views: 441

Answers (1)

maogautam
maogautam

Reputation: 318

Let me answer 3rd question first, since 2.0.0 spark csv is embedded. But in older version of spark we have to use spark-csv library. Databricks invented spark-csv at the early stage(1.3+).

To address your 1st and 2nd question, it's kind of spark 1.6 vs 2.0+ comparison. You will get all the feature provided by spark-csv + spark 2.0 feature if you use SparkSession. If you use spark-csv then you will loose those features.

Hope this helps.

Upvotes: 2

Related Questions