cozyss
cozyss

Reputation: 1398

When to use rdd in Spark2.0?

With the new SparkSQL APIs, it seems that we don't need RDD anymore. Since RDD is expensive, it seems that we should avoid it. Can someone explain when is a good time to use RDD in Spark2?

Upvotes: 3

Views: 1563

Answers (2)

jkschin
jkschin

Reputation: 5844

TLDR: You should only use an RDD if you need fine-grained control over the physical distribution of data.

This might not be relevant to Spark 2.0 and is probably relevant for Spark 2.2 and after. I found this in Spark: The Definitive Guide and I found this section of the book helpful in deciding whether or not to use an RDD:

There are basically no instances in modern Spark, for which you should be using RDDs instead of the structured APIs beyond manipulating some very raw unprocessed and unstructured data (p. 44).

If you decide that you absolutely need to use RDDs, you can refer to p. 212 in the book in the section on "When to use RDDs". Excerpt reproduced:

In general, you should not manually create RDDs unless you have a very, very specific reason for doing so. They are a much lower-level API that provides a lot of power but also lacks a lot of the optimizations that are available in the Structured APIs. For the vast majority of use cases, DataFrames will be more efficient, more stable, and more expressive than RDDs.

The most likely reason for why you'll want to use RDDs is because you need fine-grained control over the physical distribution of data (custom partitioning of data). (p. 212)

Upvotes: 1

Alper t. Turker
Alper t. Turker

Reputation: 35249

it seems that we don't need RDD anymore

RDD API is more general and in fact is SQL API is build on top of RDD API with a bunch of extensions.

Since RDD is expensive, it seems that we should avoid it.

RDD API is not inherently expensive. It just doesn't provide the same optimizations as SQL API. You can still build high performance applications on top of RDDs (check for example org.apache.spark.ml).

Can someone explain when is a good time to use RDD in Spark2?

It is opinion based but if you need end-to-end type safety or work a lot with types, which don't have built-in encoders, RDD API is a natural choice.

You may prefer RDD when order of execution is important (you can create your own planner rules with SQL, but it is much more effort) or you need low level control (like user defined Partitioners).

Upvotes: 4

Related Questions