Reputation: 14967
I have some highly customized processing logic that I want to implement in C++. Hadoop Streaming enables me to integrate the C++-coded logic into the MapReduce processing pipeline. I'm wondering whether I can do the same with Apache Spark.
Upvotes: 0
Views: 167
Reputation: 35229
The closest (but not exactly the equivalent) solution, is RDD.pipe
method:
Return an RDD created by piping elements to a forked external process. The resulting RDD is computed by executing the given process once per partition. All elements of each input partition are written to a process's stdin as lines of input separated by a newline. The resulting partition consists of the process's stdout output, with each line of stdout resulting in one element of the output partition. A process is invoked even for empty partitions.
The print behavior can be customized by providing two functions.
Spark test suite provides a number of usage examples.
Upvotes: 1