Strange Spark behavior with cache and action

Question

I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs.

There is a very similar post to SO here:

Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'.

Basically the other post explains, that when you read from the same HDFS directory that you are writing to, and your SaveMode is "overwrite", then you will get a java.io.FileNotFoundException.

But here I am finding that just moving where in the program the action is can give very different results - either completing the program or giving this exception.

I was wondering if anyone can explain why Spark is not being consistent here?

 val myDF = spark.read.format("csv")
    .option("header", "false")
    .option("delimiter", "	")
    .schema(schema)
    .load(myPath)

// If I cache it here or persist it then do an action after the cache, it will occasionally 
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.

      myDF.cache()
      myDF.show(1)

// Just an example.
// Many different transformations are then applied...

val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)

val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)

// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// fourthDF.show(1) or myDF.show(1) is left commented out

fourthDF.show(1)
fourthDF.write
    .mode(writeMode)
    .option("header", "false")
    .option("delimiter", "	")
    .csv(myPath)

Shaido · Accepted Answer

Try using count instead of show(1), I believe the issue is due to Spark trying to be being smart and not loading the whole dataframe (since show does not need everything). Running count forces Spark to load and properly cache all the data which will hopefully make the inconsistency go away.

Strange Spark behavior with cache and action

Answers (2)

Related Questions