Mathieu Longtin
Mathieu Longtin

Reputation: 16700

How to keep Spark from splitting text files

When using sqlContext.load for a multiple text files, how do you keep Spark from splitting each file in multiple partitions? It's not a problem with gzip'd files, I would like it to work the same for regular text files.

sc.wholeTextFile would work except reading an entire 100MB file somehow requires 3G of memory, so I would rather use some sort of streaming, since we would sometimes need to read much larger files.

Upvotes: 2

Views: 354

Answers (1)

Hamel Kothari
Hamel Kothari

Reputation: 737

Splittability is a feature of your InputFormat. TextInputFormat has conditional splitability depending on the source (plain text, some compressed text can be split but gzip is fundamentally not splittable).

To get the behavior you want you can just extend TextInputFormat as your own NonSplittingTextInputFormat and override the isSplittable method to always return false. Then you can load your files via code similar to the way it's implemented in sc.textFile:

import org.apache.hadoop.fs.{FileSystem, Path}

class NonSplittingTextInputFormat extends TextInputFormat {
  override protected def isSplitable(context: FileSystem, file: Path): Boolean = false
}

sc.hadoopFile(path, classOf[NonSplittableInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString)

Upvotes: 3

Related Questions