Reputation: 21
On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
Upvotes: 2
Views: 1134
Reputation: 5165
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.
Upvotes: 0
Reputation: 18108
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
Upvotes: 1
Reputation: 707
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
Upvotes: 3