What is the memory layout of a non-HDFS RDD?

Question

I am new to Spark and I am trying to get an intuition for how RDDs are represented in memory.

HDFS RDDs are easy as the partitions are handled by the filesystem itself. That is, HDFS itself divides a large file into blocks to ensure fault tolerance, hence an RDD on top of that is trivial.

But what about an RDD that is NOT pointing to data on HDFS, maybe something that points to a RDBMS or MongoDB.

A couple of questions that immediately come to mind:

How is the partitioning handled for something that is not natively partitioned?
- Does Spark handle the partitioning by itself? Or does it rely on the RDD implementation to do that ?
How does the execution work ?
- Take for example, a RDD that points to a RDBMS, does Spark load all its data into memory for each executor process, and then run all transformations?
- Or does understand the table structure of RDBMS and do some sort of partitioning when doing the transformations ?

In short, I am unable to understand how Spark can make RDD a generalized abstraction for all sorts of data sources.

What is the memory layout of a non-HDFS RDD?

Answers (1)

Related Questions