Spark streaming creates one task per input file

Question

I am processing sequence of input files with Spark streaming.

Spark streaming creates one task per input file and corresponding no of partitions and output part files.

JavaPairInputDStream myRDD =
        jssc.fileStream(path, Text.class, CustomDataType.class, SequenceFileInputFormat.class,
            new Function() {
          @Override
          public Boolean call(Path v1) throws Exception {
            return Boolean.TRUE;
          }
        }, false);

For example if there are 100 input files in an interval.

Then there will be 100 part files in the output file.

What each part file represents? (output from a task)

How to get reduce the no of output files (2 or 4 ...)?

Does this depend on no of partitioners?

Patrick McGloin · Accepted Answer

Each file represents an RDD partition. If you want to reduce the number of partitions you can call repartition or coalesce with the number of partitions you wish to have.

https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Spark streaming creates one task per input file

Answers (1)

Related Questions