Nick Allen
Nick Allen

Reputation: 1467

How to 'Pipe' Binary Data in Apache Spark

I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?

bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")

Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.

Upvotes: 6

Views: 3745

Answers (1)

Nick Allen
Nick Allen

Reputation: 1467

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.

  1. Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)

  2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.

Upvotes: 5

Related Questions