nick.katsip
nick.katsip

Reputation: 868

Spark integrated with Hadoop InputFormat confusion

I am currently trying to use custom InputSplit and RecordReader with Apache Spark's SparkContext hadoopRDD() function.

My question is the following:

Does the value returned by InpuSplit.getLenght() and/or RecordReader.getProgress() affect the execution of a map() function in the Spark Runtime?

I am asking because I have used these two custom classes on Apache Hadoop and they work as intended. However, in Spark, I see that new InputSplit objects are generated during Runtime, which is something I do not want my code to do. To be more precise:

In the beginning of the execution, I see in my log files that the correct number of InputSplit objects are generated (let us say only 1 for this example). In turn, a RecordReader object associated to that split, is generated and starts fetching records. At some point, I get a message that the Job that is handling the previous InputSplit stops, and a new Job is spawned with a new InputSplit. I do not understand why this is happening? Does it have to do with the value returned by RecordReader.getProgress() method or InputSplit.getLength() method?

Also, I define the InputSplit's length to be some arbitrary large number of bytes (i.e. 1GB). Does this value affect the number of Spark Jobs that are spawned during Runtime?

Any help and/or advice is welcome?

Thank you, Nick

P.S.-1 : I apologize for posting so many questions, but Apache Spark is a new Tool with little documentation on the Hadoop-Spark integration through HadoopRDDs.

P.S.-2: I can provide more technical details if they are needed.

Upvotes: 0

Views: 294

Answers (1)

hnahak
hnahak

Reputation: 246

yes if return any value from getLength() then after reading those no. of bytes from your file, hadoop will generate a new split to read further data. If you don't want this behavior, override method InputFormat.getSplits() to return false. i.e.. you dont want it splits.

getProgress() method nothing to do with generating a new split.

Upvotes: 1

Related Questions