intellect_dp
intellect_dp

Reputation: 169

Will spark load data into in-memory if data is 10 gb and RAM is 1gb

If i have cluster of 5 nodes, each node having 1gb ram, now if my data file is 10gb distributed in all 5 nodes, let say 2gb in each node, now if i trigger

val rdd = sc.textFile("filepath")

rdd.collect

will spark load data into the ram and how spark will deal with this scenario will it straight away deny or will it process it.

Upvotes: 2

Views: 11196

Answers (3)

PRAFULLA KUMAR DASH
PRAFULLA KUMAR DASH

Reputation: 33

In the otherhand if we do so, Performance will be impacted, we will not get the speed we want.

Also Spark store only results in RDD, so I can say result would not be complete data, any worst case if we are doing select * from tablename, it will give data in chunks , what it can affroad....

Upvotes: 0

deepika patel
deepika patel

Reputation: 116

Lets understand the question first @intellect_dp you are asking, you have a cluster of 5 nodes (here the term "node" I am assuming machine which generally includes hard disk,RAM, 4 core cpu etc.), now each node having 1 GB of RAM and you have 10 GB of data file which is distributed in a manner, that 2GB of data is residing in the hard disk of each node. Here lets assume that you are using HDFS and now your block size at each node is 2GB.

now lets break this :

  • each block size = 2GB
  • RAM size of each node = 1GB

Due to lazy evaluation in spark, only when "Action API" get triggered, then only it will load your data into the RAM and execute it further.

here you are saying that you are using "collect" as an action api. Now the problem here is that RAM size is less than your block size, and if you process it with all default configuration (1 block = 1 partition ) of spark and considering that no further node will going to add up, then in that case it will give you out of memory exception.

now the question - is there any way spark can handle this kind of large data with the given kind of hardware provisioning?

Ans - yes, first you need to set default minimum partition :

val rdd = sc.textFile("filepath",n)

here n will be my default minimum partition of block, now as we have only 1gb of RAM, so we need to keep it less than 1gb, so let say we take n = 4, now as your block size is 2gb and minimum partition of block is 4 :

each partition size will be = 2GB/4 = 500mb;

now spark will process this 500mb first and will convert it into RDD, when next chunk of 500mb will come, the first rdd will get spill to hard disk (given that you have set the storage level "MEMORY_AND_DISK_ONLY").

In this way it will process your whole 10 GB of data file with the given cluster hardware configuration.

Now I personally will not recommend the given hardware provisioning for such case, as it will definitely process the data, but there are few disadvantages :

  • firstly it will involve multiple I/O operation making whole process very slow.

  • secondly if any lag occurs in reading or writing to the hard disk, your whole job will get discarded, you will go frustrated with such hardware configuration. In addition to it you will never be sure that will spark process your data and will be able to give result when data will increase.

So try to keep very less I/O operation, and Utilize in memory computation power of spark with an adition of few more resources for faster performance.

Upvotes: 9

RefiPeretz
RefiPeretz

Reputation: 563

When you use collect all the data send is collected as array only in driver node. From this point distribution spark and other nodes does't play part. You can think of it as a pure java application on a single machine.

You can determine driver's memory with spark.driver.memory and ask for 10G.

From this moment if you will not have enough memory for the array you will probably get OutOfMemory exception.

Upvotes: 1

Related Questions