mikera
mikera

Reputation: 106351

Efficient binary serialization for Clojure/Java

I'm looking for a way to efficiently serialize Clojure objects into a binary format - i.e. not just doing the classic print and read text serialization.

i.e. I want to do something like:

(def orig-data {:name "Data Object" 
                :data (get-big-java-array) 
                :other (get-clojure-data-stuff)})

(def binary (serialize orig-data))

;; here "binary" is a raw binary form, e.g. a Java byte array
;; so it can be persisted in key/value store or sent over network etc.

;; now check it works!

(def new-data (deserialize binary))

(= new-data orig-data)
=> true

The motivation is that I have some large data structures that contain a significant amount of binary data (in Java arrays), and I want to avoid the overhead of converting these all to text and back again. In addition, I'm trying to keep the format compact in order to minimise network bandwidth usage.

Specific features I'd like to have:

What's the best / standard approach to doing this in Clojure?

Upvotes: 6

Views: 4393

Answers (4)

mpenet
mpenet

Reputation: 367

Nippy is one of the best choices imho: https://github.com/ptaoussanis/nippy

Upvotes: 5

j-g-faustus
j-g-faustus

Reputation: 8999

I may be missing something here, but what's wrong with the standard Java serialization? Too slow, too big, something else?

A Clojure wrapper for plain Java serialization could be something like this:

(defn serializable? [v]
  (instance? java.io.Serializable v))

(defn serialize 
  "Serializes value, returns a byte array"
  [v]
  (let [buff (java.io.ByteArrayOutputStream. 1024)]
    (with-open [dos (java.io.ObjectOutputStream. buff)]
      (.writeObject dos v))
    (.toByteArray buff)))

(defn deserialize 
  "Accepts a byte array, returns deserialized value"
  [bytes]
  (with-open [dis (java.io.ObjectInputStream.
                   (java.io.ByteArrayInputStream. bytes))]
    (.readObject dis)))

 user> (= (range 10) (deserialize (serialize (range 10))))
 true

There are values that cannot be serialized, e.g. Java streams and Clojure atom/agent/future, but it should work for most plain values, including Java primitives and arrays and Clojure functions, collections and records.

Whether you actually save anything depends. In my limited testing on smallish data sets serializing to text and binary seems to be about the same time and space.

But for the special case where the bulk of the data is arrays of Java primitives, Java serialization can be orders of magnitude faster and save a significant chunk of space. (Quick test on a laptop, 100k random bytes: serialize 0.9 ms, 100kB; text 490 ms, 700kB.)

Note that the (= new-data orig-data) test doesn't work for arrays (it delegates to Java's equals, which for arrays just tests whether it's the same object), so you may want/need to write your own equality function to test the serialization.

user> (def a (range 10))
user> (= a (range 10))
true
user> (= (into-array a) (into-array a))
false
user> (.equals (into-array a) (into-array a))
false
user> (java.util.Arrays/equals (into-array a) (into-array a))
true

Upvotes: 11

amalloy
amalloy

Reputation: 91837

If you don't have a schema ahead of time, serializing to text is probably your best bet. To serialize arbitrary data in general, you need to do a lot of work to preserve the object graph, and do reflection to see how to serialize everything...at least Clojure's printer can do a static, no-reflection lookup of the print-method for each item.

Conversely, if you really want an optimized wire format, you need to define a schema. I've used thrift from java, and protobuf from clojure: neither is loads of fun, but it's not hideously onerous if you plan in advance.

Upvotes: 3

Nano Taboada
Nano Taboada

Reputation: 4182

Have you considered Google's protobuf? You might want to check the GitHub repository with the interface for Clojure.

Upvotes: 4

Related Questions