Vijay Innamuri
Vijay Innamuri

Reputation: 4372

Spark streaming creates one task per input file

I am processing sequence of input files with Spark streaming.

Spark streaming creates one task per input file and corresponding no of partitions and output part files.

JavaPairInputDStream<Text, CustomDataType> myRDD =
        jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
            new Function<Path, Boolean>() {
          @Override
          public Boolean call(Path v1) throws Exception {
            return Boolean.TRUE;
          }
        }, false);

For example if there are 100 input files in an interval.

Then there will be 100 part files in the output file.

What each part file represents? (output from a task)

How to get reduce the no of output files (2 or 4 ...)?

Does this depend on no of partitioners?

Upvotes: 0

Views: 354

Answers (1)

Patrick McGloin
Patrick McGloin

Reputation: 2234

Each file represents an RDD partition. If you want to reduce the number of partitions you can call repartition or coalesce with the number of partitions you wish to have.

https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Upvotes: 0

Related Questions