Liam385
Liam385

Reputation: 318

Deserialize an in-memory Hadoop sequence file object

Pyspark has a function sequenceFile that allows us to read a sequence file which is stored in HDFS or some local path available to all nodes.

However, what if I already have a bytes object in the driver memory that I need to deserialize and write as a sequence file?

For example, the application that I am working on ( I cannot change the application logic) runs a spark job that writes this file to a non HDFS compliant file system, which i can then retrieve as an in-memory python bytes object , which seems to just contain a serialized Sequence object which I should be able to be deserialized in-memory.

Because this object is already in memory ( for reason I cannot control) the only way I have to deserialize it and actually see the output ( which is a json file) currently is to write it as a file locally, move that file into HDFS, then read the file using the sequenceFile method ( since that method only works with a file that is on an HDFS file path or local path on every node) - this creates problems in the application workflow.

What I need to be able to do is deserialize this in memory so that I can write it as a json file without having to write is locally and then put it into HDFS only to read it back in with spark

Is there anyway in python to take this bytes like NullWritable Object and deserialize it into either a python dictionary or put it back into hadoop as something that I could actually read?

enter image description here

Upvotes: 1

Views: 226

Answers (1)

Matt Andruff
Matt Andruff

Reputation: 5125

Basically you'd have to look into the sequence file code of spark itself and apply the correct pieces and convert it into an RDD so that you can then do spark things on it like writing to a file.

Here's a link to get you started but it will need some digging.

Upvotes: 0

Related Questions