intellect_dp
intellect_dp

Reputation: 169

Are we able to retain all the persisted RDDs if we lose spark context

Let me elaborate my question:

I am using a cluster which contains a master node and 3 worker node, my master

node has spark context available.

I have saved my RDD into the disk using storage level "DISK_ONLY".

When I run my spark script it will save some RDD to hard disk of any worker

node, now when my master machine goes down, which has spark context and as a

result it will also go down, thus all the DAG information lost.

Now I have to restart my master node so as to make spark context up and

running again.

now the question is - will I be able to retain all saved RDD back with this

bouncing (restarting master node and spark context daemon)? as everything is

restarted.

Upvotes: 2

Views: 268

Answers (3)

Ged
Ged

Reputation: 18003

The short answer is NO. Best to failover your Master.

Alternatively or complimentary you could split up your jobs using a scheduler and use Spark bucketBy approach.

Upvotes: 1

mkhan
mkhan

Reputation: 621

I don't think currently there is a way to restore cached RDD after shutting down the Spark Context. The component that puts and gets RDD blocks is the BlockManager component of Spark. This, in turn, uses another component named BlockInfoManager to keep track of RDD block info. When a BlockManager shuts down in a worker node, it clears the resources that it was using. Among them is the BlockInfoManager, which has the HashMap containing the RDD block info. As this Map is also cleared in the process of cleaning up, when next time it is instantiated, there is no info of any RDD blocks being saved in that worker. Thus it will treat that block as uncomputed.

Upvotes: 4

deepika patel
deepika patel

Reputation: 116

According to @intellect_dp explanation, if you are using any cluster manager for example - Apache Mesos or Hadoop Yarn, then you need to specify which deploy mode you want to go with , "cluster mode" or "client mode",

Deploy mode distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.

Upvotes: 1

Related Questions