Reputation: 71
I am pretty new to databricks and pyspark. I am creating one dataframe by reading a csv file but i am not calling any action. But I am seeing two jobs are running. Can someone explain why.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("uber_data_analysis").getOrCreate()
df = spark.read.csv("/FileStore/tables/uber_data.csv", header = True, inferSchema = True)
Upvotes: -1
Views: 95
Reputation: 18098
The point is that the question is about the Databricks environment that I also use (here). It could well be that for HDP on-prem or Cloudera this optimization does not happen, but could be a configuration option of such environments. However, I got tired of setting up (Hive) metastores etc. for the plain-vanilla Spark stuff. So I cannot remember but we see some stuff alluding to that.
Below with both False
parameters we get 1 Job. Path checking, partitions, what not. An error is given if file cannot be found.
As soon as you request (inferSchema=True)
, then there will be an extra job, for that exact process up-front of an Action.
So:
Upvotes: 2
Reputation: 17534
There will always be at least one job, which will verify some basic things like existence of source (file, folder, table, ...).
As soon as you try to read some catalog like (spark.read.table('hive_metastore.default.table1')inferSchema=True))
, then it won't need to "read the data" to figure out the schema as it's part of the catalog and there will be only one extra job.
If you're looking under the hood for a specific reason then update OP, otherwise in general this is not really something you should bother with unless it's affecting your job.
Upvotes: 0