How schema is Inferring in spark?

Question

I have one CSV with the following data:

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
United States, Romania, 1
United States, Ireland, 264
United States, India, 69

I was trying to understand how is this following 2nd snippet(Snippet 2) working? I went through the documentation of spark which says it will not read/load data until action is called on that dataFrame.

Here I have created schema with wrong types and it's printing correctly as I have not called any action yet. But in 2nd snippet, I have used inferSchema instead of the custom schema to load CSV. Now my question is how I got my correct schema type without going through the data? As I have not called any action yet! Notice that I got integer type at count

Snippet 1

val myManualSchema = new StructType(Array(
      StructField("DEST_COUNTRY_NAME", LongType, true),
      StructField("ORIGIN_COUNTRY_NAME", LongType, true),
      StructField("count", LongType, false) ))

    val csv2010 = getClass.getClassLoader.getResource("2010-summary.csv").getFile
    spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .schema(myManualSchema)
      .load(csv2010).printSchema()

    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: long (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: long (nullable = true)
 |-- count: long (nullable = true)
     */

Snippet 2

spark.read.format("csv")
      .option("header", "true")
      .option("mode", "FAILFAST")
      .option("inferSchema", true)
      .load(csv2010).printSchema()

    /* OUTPUT:
    root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)
     */

How schema is Inferring in spark?

Answers (1)

Related Questions