Reputation: 4372
I am processing sequence of input files with Spark streaming.
Spark streaming creates one task per input file and corresponding no of partitions and output part files.
JavaPairInputDStream<Text, CustomDataType> myRDD =
jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
new Function<Path, Boolean>() {
@Override
public Boolean call(Path v1) throws Exception {
return Boolean.TRUE;
}
}, false);
For example if there are 100 input files in an interval.
Then there will be 100 part files in the output file.
What each part file represents? (output from a task)
How to get reduce the no of output files (2 or 4 ...)?
Does this depend on no of partitioners?
Upvotes: 0
Views: 354
Reputation: 2234
Each file represents an RDD partition. If you want to reduce the number of partitions you can call repartition or coalesce with the number of partitions you wish to have.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations
Upvotes: 0