Reputation: 13118
There are a lot of RDDs in Spark; from the docs:
and I do not understand what they are supposed to be.
Additionally I noticed that there are
ParallelCollectionRDD
MapPartitionsRDD
which are not listed though they appear very often in my spark-shell as objects.
Why are there different RDDs and what are their respective purposes?
I understood from tutorials and books (e.g. "Learning Spark") that there are two types of operations on RDDs: Those for RDDs which have pairs (x, y)
and all the other operations. So I would expect to have class RDD
and PairRDD
and that's it.
I suspect that I got it partly wrong and what is actually the case is that a lot of RDD classes could be just one RDD class - but that would make things less tidy. So instead, the developers decided to put different methods into different classes and in order to provide those to any RDD class type, they use implicit
to coerce between the class types. I suspect that due to the fact that many of the RDD class types end with "Functions" or "Actions" and text in the respective scaladocs sound like this.
Additionally I suspect that some of the RDD classes still are not like that, but have some more in-depth meaning (e.g. ShuffledRDD
).
However - I am not sure about any of this.
Upvotes: 2
Views: 229
Reputation: 330393
First of all roughly a half of the listed classes don't extend RDD but are type classes designed to augment RDD with different methods specific to the stored type.
One common example is RDD[(T, U)]
, commonly known as PairRDD
, which is enriched by methods provided by PairRDDFunctions
like combineByKeyWithClassTag
which is a basic building block for all byKey
transformations. It is worth nothing that there is no such class as PairRDD
or PairwiseRDD
and these names are purely informal.
There are also a few commonly used subclasses of the RDD
which are not a part of the public API and such are not listed above. Some examples worth mentioning are ParallelCollectionRDD
and MapPartitionsRDD
.
RDD
is an abstract class which doesn't implement two important methods:
compute
which computes result for a given partitiongetPartitions
which return a sequence of partitions for a given RDD
In general there are two reasons to subclass RDD;
ParallelCollectionRDD
, JdbcRDD
)RDD
which provides non standard transformationsSo to summarize:
RDD
class provides a minimal interface for RDDs
.RDD
provide internal logic required for actual computations based on external sources and / or parent RDDs
. These are either private or part of the developer API and, excluding debug strings or Spark UI, are not exposed directly to the final user.RDD
and not dependent on how it has been created. Upvotes: 2