Davyzhu
Davyzhu

Reputation: 1109

Serialize persistent data structures in clojure

We all know that Rich uses a ideal hash tree-based method to implement the persistent data structures in Clojure. This structure enables us to manipulate the persistent data structures without copying a lot.

But it seems I cannot find the correct way to serialize this specific structure. For example given:

(def foo {:a :b :c :d})
(def bar (assoc foo :e :f))
(def bunny {:foo foo :bar bar})

My question is:

How can I serialize the bunny such that the contents of foo, i.e. :a mapping to :b and :c mapping to :d, appear only once in the serialized content? It's like dumping a memory image of the structures. It's also like serializing the "internal nodes" as well as the "leaf nodes" referenced here.

P.S. In case this is relevant, I am building a big DAG (directed acyclic graph) where we assoc quite a bit to link these nodes to those nodes, and want to serialize the DAG for later de-serialization. The expanded representation of the graph (i.e., the content one'll get when printing the DAG in repl) is unacceptably long.

Upvotes: 4

Views: 407

Answers (2)

Frank C.
Frank C.

Reputation: 8088

Davyzhu,

Few things first:

  1. The DAG, without tokenization strategy, will be as long as the DAG is. If foo is referenced 1 or more times each will be fully realized (i.e. displayed) in turn during printing.
  2. For the interchanges of the information (serialize and deserialize) it will be largely dependent on your goals. For example, if you are serializing to send it off over the wire you will either want to do it fully (like the printed representation) or you will need to encode individual data points with some identification/tokenization strategy. The latter, of course, assumes the receiving end can deserialize with understanding of the tokenization protocol.
  3. The tokenization strategy example, could use Clojure meta facilities perhaps, would require encoding unique keys for each content block reference and your DAG contains nodes where the edges are represented by the keys.

Edit:: Modified since original post to clarify as per comments but the example does not reflect the hierarchical nature of the DAG.

A contrived example:

(def node1 {:a :b :c :d})
(def node2 {:e :f})
(def dictionary {:foo node1 :bar node2})

(def DAG [:bunny [:foo :bar]])

(println DAG) ; => [:bunny [:foo :bar]]

(defn expand-dag1
  [x]
  (if (keyword? x)
    (get dictionary x x)
    x))

(println (w/postwalk expand-dag1 DAG)) ; => [:bunny [{:a :b, :c :d} {:e :f}]]

Note: Use of vectors, maps, lists, etc. to express your DAG is up to you.

Upvotes: 1

johnbakers
johnbakers

Reputation: 24750

This is one option (that will not work in Clojurescript, in case that matters) and in general may be perceived as a bad idea, but it is worth mentioning anyway.

If I understand your question, you want the foo in (def bunny {:foo foo :bar bar}) to not be "pasted" as a full copy, but rather retain a "reference" to the (def foo..) such that the original foo map is only serialized once.

  1. One technique I would consider though not necessarily encourage (and only after exhausting other options such as a reorginization of your data structure as hinted by Frank C.), is to serialize the code for bunny rather than the structure itself. Then you read the code string back in and eval it. This would only work if the structure for bunny does not change, or if it does, that you can easily build a string of the bunny map with the relevant symbols included as part of the string, rather than the contents of those symbols.

  2. But a much better idea would be to serialize your "raw" data structures only, like the maps foo and bar, then build your bunny after these are read back in -- by also serializing the structure but not the contents of bunny. I believe this is what Frank's answer is getting at.

Worth noting that if the structure of bunny does change dynamically, and you are able to create a string of symbols as suggested in 1. above, then that means you also have the tools to instead build a representation of bunny as in 2. above, which would be preferable.

Since code is data, option 1. is an example of the type of flexibility available to us as lisp programmers -- but that doesn't mean there are not better options.

Upvotes: 1

Related Questions