Using RDD transformation and converts it to a Dataset before an action VS using Dataset and its API

Question

Consider the two scenarios:

A) If I have a RDD and various RDD transformations are called on it, and before any actions are done I create a Dataset from it.

B) I create a Dataset at the very beginning and calls various Dataset methods on it.

Question: If the two scenarios produce the same outcome logically - one uses RDD transformation and converts it to a Dataset right before an action vs just using Dataset and its transformation - do both scenarios goes through the same optimizations?

Assaf Mendelson · Accepted Answer

No they do not.

When you do RDD and RDD transformation on them, no optimization is done. When you transform it to dataset in the end, then and only then conversion to tungsten based representation (which takes less memory and doesn't need to go through garbage collection) is performed.

When you use dataset from the beginning then it will use the tungsten based memory representation from the beginning. This means it will take less memory, shuffles will be smaller and faster and no GC overhead would occur (although conversion from internal representation to case class and back would occur any time typed operations are used). If you use dataframe operations on the dataset then it may also take advantage of code gen and catalyst optimizations.

See also my answer in: Do I have to explicitly use Dataframe's methods to take advantage of Dataset's optimization?

Using RDD transformation and converts it to a Dataset before an action VS using Dataset and its API

Answers (2)

Related Questions