Reputation: 61
I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried: binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
Upvotes: 6
Views: 3803
Reputation: 2238
I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>)
API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>)
. The solution, after reading what binaryFiles()
does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream
objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.
Upvotes: 1
Reputation: 14975
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
Upvotes: 1