Amrit Nayak
Amrit Nayak

Reputation: 21

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.

Upvotes: 2

Views: 1134

Answers (3)

Jim Macaulay
Jim Macaulay

Reputation: 5165

RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.

DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

enter image description here

Upvotes: 0

Ged
Ged

Reputation: 18108

I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.

RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.

DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.

Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.

Upvotes: 1

ShemTov
ShemTov

Reputation: 707

I don't see much reasons to still use RDDs.

Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:

Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.

Upvotes: 3

Related Questions