Reputation: 87
I am reading a csv file through following code:-
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[2]") \
.getOrCreate()
Now there are four different options to read:
df = spark.read.load("/..../xyz.csv")
df = spark.read.csv("/..../xyz.csv")
df = spark.read.format('csv').load("/..../xyz.csv")
df = spark.read.option().csv("/..../xyz.csv")
Which option should I use ?
EDIT:-
Also, both inferSchema="true"
and inferSchema=True
are working. Can we blindly use any one?
Upvotes: 5
Views: 5522
Reputation: 191743
2
and 3
are equivalent.
3
allows for an additional option(key, value)
function (see 4
, or spark.read.format('csv').option(...).load()
) that could allow you to skip a header row, or set a delimiter other than comma, for example.
def load(self, path=None, format=None, schema=None, **options):
"""Loads data from a data source and returns it as a :class`DataFrame`.
:param path: optional string or a list of string for file-system backed data sources.
:param format: optional string for format of the data source. Default to 'parquet'.
:param schema: optional :class:`pyspark.sql.types.StructType` for the input schema
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
:param options: all other string options
1
does not parse CSV, it uses Parquet as the default format.
I would suggest inferSchema=True
to prevent typos in the string value
Upvotes: 8
Reputation: 624
2 is an allias for 3. 1 reads by default parquet files.
For example: spark.read.csv() just calls .format("csv").load("path")
@scala.annotation.varargs
def csv(paths: String*): DataFrame = format("csv").load(paths : _*)
It doesn't matter which one you are using.(2,3,4) As I said 1 read parquet by default.
Upvotes: 5