Apache Spark: How to "cache" a dataset so it is not re-computed for next computation

Question

Small question regarding Apache Spark please.

I have a very simple piece of Spark job: (Here written in Java, but applicable to other languages)

final SparkSession     sparkSession       = SparkSession.builder().getOrCreate();
final Dataset     someVeryBigDataSet = sparkSession.read().format("org.apache.spark.sql.cassandra").options(properties).load();
final Dataset integerDataSet     = someVeryBigDataSet.map((MapFunction) row -> someSuperComplexAndHeavyComputationThatShouldBeDoneOnlyOnceToConvertRowToInteger(row), Encoders.INT());
final Dataset goodIntegerDataSet = integerDataSet.filter((FilterFunction) oneInteger -> oneInteger == 0);
final Dataset badIntegerDataSet  = integerDataSet.filter((FilterFunction) oneInteger -> oneInteger != 0);
LOGGER.info("good integer dataset size and bad integer dataset size:
" + goodIntegerDataSet.count() + " " + badIntegerDataSet.count());
sparkSession.stop();

The job is very simple:

Extract a very big dataset from some big data table
Convert each rows into an integer. For this I use a very heavy computation, and this operation should only be performed once.
Separate the good integer result from the bad integer of step 2, display the count

The issue is, I see the map method in step 2, being performed multiple times, for each and every row of the database.

My theory (please correct me if I am wrong), it is computed a first time in line 3, during the map function.

But at line 4 and line 5, in both filter functions, when we need the count, the pipeline needs the result of step 2 again.

As the map function should only be run once, how to avoid this please?

Thank you

Vincent Doba · Accepted Answer

You can use .cache() method when creating integerDataSet:

final Dataset integerDataSet = someVeryBigDataSet
  .map((MapFunction) row -> someSuperComplexAndHeavyComputationThatShouldBeDoneOnlyOnceToConvertRowToInteger(row), Encoders.INT())
  .cache();

It will persist your dataframe in memory, or if not enough space, in disk and every time you will call this dataframe, the persisted one will be loaded, without recomputing.

More details in caching strategy : https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/

Apache Spark: How to "cache" a dataset so it is not re-computed for next computation

Answers (2)

Related Questions

Apache Spark: How to &quot;cache&quot; a dataset so it is not re-computed for next computation

Answers (2)

Related Questions

Apache Spark: How to "cache" a dataset so it is not re-computed for next computation