Manasjyoti Das
Manasjyoti Das

Reputation: 81

Is Spark.read.csv() an Action or Transformation

In the book "Spark Definitive Guide" Bill says that read is a transformation and its a narrow transformation,

Now if I run the below spark code and try and go look at the spark UI I see a job created df = spark.read.csv("path/to/file")

Now to my understanding, a Job is an action called. also if I try to put in some options while reading a CSV I see one more job in spark UI, so when if we, for example, run the below code, there are 2 jobs in spark UI df = spark.read.option("inferSchema", "true").csv("path/to/file")

so my question is if spark.read is a transformation why is it creating Job?

Upvotes: 8

Views: 8572

Answers (2)

Ged
Ged

Reputation: 18043

Spark Dataframes via Catalyst have some smarts built in compared to RDDs.

One of them is when you state infer schema, then as that can take a long time, underwater Spark already launches a Job to do the schema inferring. It's that simple. It's something that is in the optimization & performance aspect and cannot be seen as Action or Transformation. Another example is the pivoting of a dataframe.

Upvotes: 0

akhil pathirippilly
akhil pathirippilly

Reputation: 1060

Transformations , especially read operations can behave in two ways according to the arguments you provide

  1. Lazily evaluated --> It will be performed only when an action is called
  2. Eagerly evaluated --> A job will be triggered to do some initial evaluations

In case of read.csv()

  • If it is called without defining the schema and inferSchema is disabled, it determines the columns as string types and it reads only the first line to determine the names (if heade=True, otherwise it gives default column names) and the number of fields. Basically it performs a collect operation with limit 1 --> That's why you can see the first job

You can see below WholeStageCodeGen as below in Spark UI for the same :

enter image description here

And also you can see physical plan as below :

enter image description here

  • Now if you specify inferSchema=True, Here above job will be triggered first as well as one more job will be triggered which will scan through entire record to determine the schema --> That's why you are able to see two jobs in spark UI

For second job , aggregated metrics by executor in Spark UI will lokk like this (highlighted the number of records read): enter image description here

  • Now If you specify schema explicitly by providing StructType() schema object to 'schema' argument of read.csv(), then you can see no jobs will be triggered here. This is because, we have provided the number of columns and type explicitly and catalogue of spark will store that information and now it doesn't need to scan the file to get that information. And this will be validated lazily at the time of calling action

Upvotes: 38

Related Questions