Is there any Apache Spark counterpart similar to Hadoop Streaming?

Question

I have some highly customized processing logic that I want to implement in C++. Hadoop Streaming enables me to integrate the C++-coded logic into the MapReduce processing pipeline. I'm wondering whether I can do the same with Apache Spark.

Alper t. Turker · Accepted Answer

The closest (but not exactly the equivalent) solution, is RDD.pipe method:

Return an RDD created by piping elements to a forked external process. The resulting RDD is computed by executing the given process once per partition. All elements of each input partition are written to a process's stdin as lines of input separated by a newline. The resulting partition consists of the process's stdout output, with each line of stdout resulting in one element of the output partition. A process is invoked even for empty partitions.

The print behavior can be customized by providing two functions.

Spark test suite provides a number of usage examples.

Is there any Apache Spark counterpart similar to Hadoop Streaming?

Answers (1)

Related Questions