Why are there different RDDs and what are their respective purposes?

Question

There are a lot of RDDs in Spark; from the docs:

AsyncRDDActions
CoGroupedRDD
DoubleRDDFunctions
HadoopRDD
JdbcRDD
NewHadoopRDD
OrderedRDDFunctions
PairRDDFunctions
PartitionPruningRDD
RDD
SequenceFileRDDFunctions
ShuffledRDD
UnionRDD

and I do not understand what they are supposed to be.

Additionally I noticed that there are

ParallelCollectionRDD
MapPartitionsRDD

which are not listed though they appear very often in my spark-shell as objects.

Question

Why are there different RDDs and what are their respective purposes?

What I understood so far

I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y) and all the other operations. So I would expect to have class RDD and PairRDD and that's it.

What I suspect

I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.

Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD).

However - I am not sure about any of this.

Why are there different RDDs and what are their respective purposes?

Question

What I understood so far

What I suspect

Answers (1)

Related Questions