Reputation: 1334
I have one CSV
with the following data:
DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69
I was trying to understand how is this following 2nd snippet(Snippet 2
) working? I went through the documentation of spark which says it will not read/load data until action is called on that dataFrame
.
Here I have created schema with wrong types and it's printing correctly as I have not called any action yet. But in 2nd snippet, I have used inferSchema instead of the custom schema to load CSV. Now my question is how I got my correct schema type without going through the data? As I have not called any action yet! Notice that I got integer type at count
Snippet 1
val myManualSchema = new StructType(Array(
StructField("DEST_COUNTRY_NAME", LongType, true),
StructField("ORIGIN_COUNTRY_NAME", LongType, true),
StructField("count", LongType, false) ))
val csv2010 = getClass.getClassLoader.getResource("2010-summary.csv").getFile
spark.read.format("csv")
.option("header", "true")
.option("mode", "FAILFAST")
.schema(myManualSchema)
.load(csv2010).printSchema()
/* OUTPUT:
root
|-- DEST_COUNTRY_NAME: long (nullable = true)
|-- ORIGIN_COUNTRY_NAME: long (nullable = true)
|-- count: long (nullable = true)
*/
Snippet 2
spark.read.format("csv")
.option("header", "true")
.option("mode", "FAILFAST")
.option("inferSchema", true)
.load(csv2010).printSchema()
/* OUTPUT:
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: integer (nullable = true)
*/
Upvotes: 0
Views: 716
Reputation: 1316
I'm not sure how the documentation exactly states it, but, while that's guaranteed for RDDs, it's actually up to the DataFrameReader
to avoid making any reads when loading.
In practice, the internal CSV reader in Spark does read the data when the inferSchema
is set to true
: see this particular line of code which calls the aggregate()
action on the internal RDD to infer the types.
Upvotes: 1