Reputation: 81
In the book "Spark Definitive Guide" Bill says that read is a transformation and its a narrow transformation,
Now if I run the below spark code and try and go look at the spark UI I see a job created
df = spark.read.csv("path/to/file")
Now to my understanding, a Job is an action called. also if I try to put in some options while reading a CSV I see one more job in spark UI, so when if we, for example, run the below code, there are 2 jobs in spark UI
df = spark.read.option("inferSchema", "true").csv("path/to/file")
so my question is if spark.read
is a transformation why is it creating Job?
Upvotes: 8
Views: 8572
Reputation: 18043
Spark Dataframes via Catalyst have some smarts built in compared to RDDs.
One of them is when you state infer schema
, then as that can take a long time, underwater Spark already launches a Job to do the schema inferring. It's that simple. It's something that is in the optimization & performance aspect and cannot be seen as Action or Transformation. Another example is the pivoting of a dataframe.
Upvotes: 0
Reputation: 1060
Transformations , especially read operations can behave in two ways according to the arguments you provide
In case of read.csv()
You can see below WholeStageCodeGen as below in Spark UI for the same :
And also you can see physical plan as below :
For second job , aggregated metrics by executor in Spark UI will lokk like this (highlighted the number of records read):
Upvotes: 38