Reputation: 16700
When using sqlContext.load
for a multiple text files, how do you keep Spark from splitting each file in multiple partitions? It's not a problem with gzip'd files, I would like it to work the same for regular text files.
sc.wholeTextFile
would work except reading an entire 100MB file somehow requires 3G of memory, so I would rather use some sort of streaming, since we would sometimes need to read much larger files.
Upvotes: 2
Views: 354
Reputation: 737
Splittability is a feature of your InputFormat. TextInputFormat has conditional splitability depending on the source (plain text, some compressed text can be split but gzip is fundamentally not splittable).
To get the behavior you want you can just extend TextInputFormat
as your own NonSplittingTextInputFormat
and override the isSplittable method to always return false. Then you can load your files via code similar to the way it's implemented in sc.textFile:
import org.apache.hadoop.fs.{FileSystem, Path}
class NonSplittingTextInputFormat extends TextInputFormat {
override protected def isSplitable(context: FileSystem, file: Path): Boolean = false
}
sc.hadoopFile(path, classOf[NonSplittableInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString)
Upvotes: 3