Reputation: 287
I know 2 ways to import a CSV file in PySpark:
1) I can use SparkSession. Here is my full code in Jupyter Notebook.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Spark Session 1').getOrCreate()
df = spark.read.csv('mtcars.csv', header = True)
2) I can use the Spark-CSV module from Databricks.
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('mtcars.csv')
1) What are the advantages of SparkSession over Spark-CSV?
2) What are the advantages of Spark-CSV over SparkSession?
3) If SparkSession is perfectly capable of importing CSV files, why did Databricks invent the Spark-CSV module?
Upvotes: 2
Views: 441
Reputation: 318
Let me answer 3rd question first, since 2.0.0 spark csv is embedded. But in older version of spark we have to use spark-csv library. Databricks invented spark-csv at the early stage(1.3+).
To address your 1st and 2nd question, it's kind of spark 1.6 vs 2.0+ comparison. You will get all the feature provided by spark-csv + spark 2.0 feature if you use SparkSession. If you use spark-csv then you will loose those features.
Hope this helps.
Upvotes: 2