ali raza md
ali raza md

Reputation: 15

DataFrame Definintion is lazy evaluation

I am new to spark and learning it. can someone help with below question

The quote in spark definitive regarding dataframe definition is "In general, Spark will fail only at job execution time rather than DataFrame definition time—even if, for example, we point to a file that does not exist. This is due to lazy evaluation,"

so I guess spark.read.format().load() is dataframe definition. On top of this created dataframe we apply transformations and action and load is read API and not transformation if I am not wrong.

I tried to "file that does not exist" in load and I am thinking this is dataframe definition. but I got below error. according to the book it should not fail right?. I am surely missing something. can someone help on this?

df=spark.read.format('csv')
.option('header', 
'true').option('inferschema', 'true')
.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv')

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Path does not exist: /spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv;' 

why dataframe definition is referring Hadoop metadata when it is lazy evaluated?

Upvotes: 1

Views: 3260

Answers (2)

Gaurang Shah
Gaurang Shah

Reputation: 12950

Spark is a lazy evolution. However, that doesn't mean It can't verify if file exist of not while loading it.

Lazy evolution happens on DataFrame object, and in order to create dataframe object they need to first check if file exist of not.

Check the following code.

@scala.annotation.varargs
  def load(paths: String*): DataFrame = {
    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
      throw new AnalysisException("Hive data source can only be used with tables, you can not " +
        "read files of Hive data source directly.")
    }

    DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider =>
      val catalogManager = sparkSession.sessionState.catalogManager
      val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
        source = provider, conf = sparkSession.sessionState.conf)
      val pathsOption = if (paths.isEmpty) {
        None
      } else {
        val objectMapper = new ObjectMapper()
        Some("paths" -> objectMapper.writeValueAsString(paths.toArray))
      }

Upvotes: 1

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29237

Till here dataframe is defined and reader object instantiated.

scala> spark.read.format("csv").option("header",true).option("inferschema",true)
res2: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@7aead157

when you actually say load.

res2.load('/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv') and the file doesnt exist...... is execution time.(that means it has to check the data source and then it has to load the data from csv)

To get dataframe its checking meta data of hadoop since it will check hdfs whether this file exist or not.

It doesnt then you are getting

org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://203-249-241:8020/spark_df_data/Spark-The-Definitive-Guide/data/retail-data/by-day/2011-12-19.csv

In general

1) RDD/DataFrame lineage will be created and will not be executed is definition time. 2) when load is executed then it will be the execution time.

See the below flow to understand better.

enter image description here

Conclude : Any traformation (definition time in your way ) will not be executed until action is called (execution time in your way)

Upvotes: 0

Related Questions