yhw82
yhw82

Reputation: 61

Spark Streaming - processing binary data file

I'm using pyspark 1.6.0.

I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.

In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")

This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .

I tried: binaryRecordsStream(directory, recordLength)

but I couldn't get this working...

Can anyone share some lights how PYSPARK streaming read binary data file?

Upvotes: 6

Views: 3803

Answers (2)

Marcus
Marcus

Reputation: 2238

I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:

JavaPairInputDStream<String, PortableDataStream> rawAuctions = 
        sc.fileStream("s3n://<bucket>/<folder>", 
                String.class, PortableDataStream.class, StreamInputFormat.class);

Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Upvotes: 1

JuJoDi
JuJoDi

Reputation: 14975

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

Upvotes: 1

Related Questions