Kansingh
Kansingh

Reputation: 63

Spark Fundamentals

I am new to Spark... some basic things i am not clear when going through fundamentals:

Query 1. For distributing processing - Can Spark work without HDFS - Hadoop file system on a cluster (like by creating it's own distributed file system) or does it requires some base distributed file system in place as a per-requisite like HDFS, GPFS, etc.

Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level (for distributed processing) or will just use the block distribution as per the Haddop HDFS cluster.

Query 3. Other than defining of a DAG does SPARK also creates the partitions like MapReduce does and shuffles partitions to the reducer nodes for further computation? I am confused on same, as till DAG creation it's clear that Spark Executor working on each Worker node loads data blocks as RDD in memory and computation is applied as per DAG .... but where does the part goes required for partitioning the data as per Keys and taking them to other nodes where reducer task will be performed (just like mapreduce) how that is done in-memory??

Upvotes: 1

Views: 830

Answers (4)

Bujuti Niranjan Reddy
Bujuti Niranjan Reddy

Reputation: 45

Query 1 :- For simple spark provide distribute processing because of abstraction RDD (resilent distribute dataset), and without HDFS this cann't provide distribute Storage.

Query 2:- No it won't recreate.Here Spark will provide every block as partition(which means reference to that block) so it launch yarn on same block

Query 3:- no idea.

Upvotes: 0

Simal Haneef
Simal Haneef

Reputation: 179

Query 1. Yes it can work with others as well . Spark works with RDDs , if you have corresponding RDD implemented thats it .When you actually create a RDD by opening a file in HDFS , it inherently creates a HADOOP RDD which has implementation for understanding the HDFS , if you write your own Distributed file system you can write your own implementation for the same and instantiate the class its done . But writing the connector RDD to our own DFS is the challenge . For more you can look at the RDD interface in spark code

Query 2. It wont re create , instead my means of the HADOOP/HDFS RDD connector it knows where the blocks are .It will also try to use the same yarn nodes to run the jvm task to do processing .

Query 3. Not sure about this

Upvotes: 0

Ravindra babu
Ravindra babu

Reputation: 38910

Query 1. For distributing processing - Can Spark work without HDFS ?

For distributed processing, Spark does not require HDFS. But it may read/write data from/to HDFS system. For some use case, it may write data to HDFS. For teragen sorting world record program, it used HDFS for sorting the data instead of using in-memoery.

Spark don't provide distributed storage. But integration with HDFS is one option for storage. But Spark can use other storage systems like Cassnadra etc. Have a look at this article for more details : https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/

Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level

I agree with Daniel Darabos response. Spark will create one partition per HDFS block.

Query 3: on shuffle

Depending on size of the data, shuffle will be done in-memory Or it may use disk (e.g. teragen sorting) or it may use both. Have a look at this excellent article on Spark shuffle.

Fine with this. What if you don’t have enough memory to store the whole “map” output? You might need to spill intermediate data to the disk. Parameter spark.shuffle.spill is responsible for enabling/disabling spilling, and by default spilling is enabled

The amount of memory that can be used for storing “map” outputs before spilling them to disk is “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, with default values it is “JVM Heap Size” * 0.2 * 0.8 = “JVM Heap Size” * 0.16.

Upvotes: 0

Daniel Darabos
Daniel Darabos

Reputation: 27455

This would be better asked as separate questions and question 3 is hard to understand. Anyway:

  1. No, Spark does not require a distributed file system.
  2. By default Spark will create one partition per HDFS block, and will co-locate computation with the data if possible.
  3. You're asking about shuffle. Shuffle creates blocks on the mappers that the reducers will fetch from them. The spark.shuffle.memoryFraction parameter controls how much memory to allocate to shuffle block files. (20% by default.) The spark.shuffle.spill parameter controls whether to spill shuffle blocks to local disk when the memory runs out.

Upvotes: 1

Related Questions