When to use rdd in Spark2.0?

Question

With the new SparkSQL APIs, it seems that we don't need RDD anymore. Since RDD is expensive, it seems that we should avoid it. Can someone explain when is a good time to use RDD in Spark2?

Alper t. Turker · Accepted Answer

it seems that we don't need RDD anymore

RDD API is more general and in fact is SQL API is build on top of RDD API with a bunch of extensions.

Since RDD is expensive, it seems that we should avoid it.

RDD API is not inherently expensive. It just doesn't provide the same optimizations as SQL API. You can still build high performance applications on top of RDDs (check for example org.apache.spark.ml).

Can someone explain when is a good time to use RDD in Spark2?

It is opinion based but if you need end-to-end type safety or work a lot with types, which don't have built-in encoders, RDD API is a natural choice.

You may prefer RDD when order of execution is important (you can create your own planner rules with SQL, but it is much more effort) or you need low level control (like user defined Partitioners).

When to use rdd in Spark2.0?

Answers (2)

Related Questions