How to keep Spark from splitting text files

Question

When using sqlContext.load for a multiple text files, how do you keep Spark from splitting each file in multiple partitions? It's not a problem with gzip'd files, I would like it to work the same for regular text files.

sc.wholeTextFile would work except reading an entire 100MB file somehow requires 3G of memory, so I would rather use some sort of streaming, since we would sometimes need to read much larger files.

Hamel Kothari · Accepted Answer

Splittability is a feature of your InputFormat. TextInputFormat has conditional splitability depending on the source (plain text, some compressed text can be split but gzip is fundamentally not splittable).

To get the behavior you want you can just extend TextInputFormat as your own NonSplittingTextInputFormat and override the isSplittable method to always return false. Then you can load your files via code similar to the way it's implemented in sc.textFile:

import org.apache.hadoop.fs.{FileSystem, Path}

class NonSplittingTextInputFormat extends TextInputFormat {
  override protected def isSplitable(context: FileSystem, file: Path): Boolean = false
}

sc.hadoopFile(path, classOf[NonSplittableInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString)

How to keep Spark from splitting text files

Answers (1)

Related Questions