Reputation: 14258
I have 4 arrays, that are ordered. I'd like to be able to combine them together into a single sorted datastructure and lazily take from that.
Is there an efficient way of doing this?
[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]
=> [1 1 1 2 2 3 3 3 3 4]
Upvotes: 4
Views: 428
Reputation: 394
Both Rulle's solution and leetwinski's second one use iterate
in quite an anamorphic manner (especially the latter). Let's define unfold
using iterate
(usually the opposite is done) and write an explicitly anamorphic solution:
(defn unfold [f s]
(->> s
(list nil)
(iterate (comp f second))
rest
(take-while some?)
(map first)))
(defn merge-sorted [s]
(->> s
(filter seq)
(unfold
(fn [s]
(if (seq s)
(loop [[[mf & mn :as m] & s] s, r ()]
(if-let [[[xf :as x] & s] s]
(let [[m x] (if (< xf mf) [x m] [m x])]
(recur (cons m s) (cons x r)))
(list mf (if mn (cons mn r) r)))))))))
Here is a version of merge-sorted
that uses reduce
instead of loop
and recur
:
(defn merge-sorted [s]
(->> s
(filter seq)
(unfold
(fn [s]
(if (seq s)
(let [[[mf & mn] r]
(reduce
(fn [[m r] x]
(if (< (first x) (first m))
[x (cons m r)]
[m (cons x r)]))
[(first s) ()]
(rest s))]
(list mf (if mn (cons mn r) r))))))))
Impressed by the efficiency of A. Webb's solution after the edit and considering this problem interesting and relatively important, I had a look at the Wikipedia articles on merge and k-way merge algorithms and at this paper. I found out that there is plenty of room for analysis/experimentation/improvement and decided to (re)implement and test several algorithms. Here they are, packed in a map, preceded by some helper functions and followed by some functions useful for testing:
(require ['clojure.core.reducers :as 'reducers])
(defn mapmap [f m]
(reduce #(update %1 %2 f) m (keys m)))
(defn unfold [f s]
(->> s
(list nil)
(iterate (comp f second))
rest
(take-while some?)
(map first)))
(defn foldh [f s]
((fn rec [v]
(f (if (> (count v) 2)
(let [h (quot (count v) 2)]
(map rec [(subvec v 0 h) (subvec v h)]))
v)))
(vec s)))
(defn fold2 [f s]
(loop [s s]
(if (nnext s)
(recur (map f (partition-all 2 s)))
(f s))))
(def merge-sorted
(merge
;direct lazy algorithms
(mapmap
(fn [[prepare choose insert]]
(fn [s]
(->> s
(filter seq)
prepare
(unfold
(fn [s]
(if (seq s)
(let [[[xf & xn] s] (choose s)]
[xf (if xn (insert s xn) s)])))))))
{:min
[identity
(fn [s]
(reduce
(fn [[x s] y]
(if (< (first x) (first y))
[x (cons y s)]
[y (cons x s)]))
[(first s) ()]
(rest s)))
conj]
:sort
[(partial sort-by first)
(juxt first rest)
(fn [s [xf :as x]]
(let [[a b] (loop [a () b (seq s)]
(if-let [[bf & bn] b]
(if (< (first bf) xf)
(recur (cons bf a) bn)
[a b])
[a b]))]
(into (cons x b) a)))]
:lsort
[(partial sort-by first)
(juxt first rest)
(fn [s [xf :as x]]
((fn rec [s]
(lazy-seq
(if-let [[sf] (seq s)]
(if (< (first sf) xf)
(cons sf (rec (rest s)))
(cons x s))
(list x))))
s))]
:heap
[(fn [s]
(let [h (java.util.PriorityQueue.
(count s)
#(< (first %1) (first %2)))]
(run! #(.add h %) s)
h))
(fn [h] [(.poll h) h])
(fn [h x] (.add h x) h)]})
;folding lazy algorithms
(mapmap
(letfn [(merge2 [s]
(lazy-seq
(if-let [[x & s] (seq (filter seq s))]
(if-let [[y] s]
((fn rec [x y]
(lazy-seq
(let [[[xf & xn] y]
(if (< (first x) (first y))
[x y]
[y x])]
(cons xf (if xn (rec xn y) y)))))
x y)
x))))]
(fn [fold] (partial fold merge2)))
{:foldl #(reduce (comp %1 list) %2)
:foldh foldh
:fold2 fold2})
;folding eager algorithms
(mapmap
(letfn [(merge2 [s]
(if-let [[x & s] (seq (filter seq s))]
(if-let [[y] s]
(loop [x x y y acc ()]
(let [[[xf & xn] y]
(if (< (first x) (first y))
[x y]
[y x])
acc (conj acc xf)]
(if xn
(recur xn y acc)
(into y acc))))
x)
()))]
(fn [fold] (partial fold merge2)))
{:efoldp #(reducers/fold 2 (comp %1 list) (comp %1 list) (vec %2))
:efoldh foldh
:efold2 fold2})))
(defn gen-inp [m n]
(->> 0
(repeat m)
(map
(comp
doall
(partial take n)
rest
(partial iterate #(+ % (rand-int 100)))))
doall))
(defn test-merge-sorted [m n & algs]
(->> (or algs (sort (keys merge-sorted)))
(map (juxt name merge-sorted))
(run!
(let [inp (gen-inp m n)]
(fn [[id alg]]
(println id)
;(java.lang.System/gc)
(try
(time (doall (alg inp)))
(catch java.lang.StackOverflowError _
(prn "Stack overflow"))))))))
The direct lazy algorithms follow a common scheme parameterized by how the following are done:
:min
is like my first solution, which calculates a minimum at each iteration.
:sort
is like Rulle's solution, which sorts the sequences initially and makes a sorted-insertion at each iteration.
:lsort
is like :sort
but with lazy insertion. It can cause stack overflow because of nested lazy sequences.
:heap
is a simple but suboptimal implementation of heap-merge using Java's PriorityQueue
s.
The folding lazy algorithms follow a common scheme parameterized by how a <=2
-ary merge is extended to arbitrary arities.
:foldl
is like A. Webb's solution before the edit, which does a left folding using reduce
. It can cause stack overflow because of nested lazy sequences.
:foldh
is an implementation of divide-and-conquer-merge, which folds by splitting in half.
:fold2
is like A. Webb's solution after the edit, which folds by splitting in pairs.
The folding eager algorithms follow a scheme like that of the lazy ones but using an eager <=2
-ary merge.
:efoldp
is a parallel implementation of divide-and-conquer-merge using clojure.core.reducers/fold
, which does various <=2
-ary merges concurrently and possibly in parallel, by "forking" every time it splits in half.
:efoldh
and :efold2
are like :foldh
and :fold2
but eager.
As a short remark on performance I would say that, for fast lazy merging, one of :foldh
, :fold2
or maybe :heap
should be chosen. As for eager merging, depending on the hardware's parallelism ability and on the input's shape, :efoldp
can be faster than the rest. For more information see the linked articles.
Upvotes: 1
Reputation: 26436
You can just write this out explicitly, for a lazy performant version.
(defn merges
([x] x)
([x y]
(cond
(empty? x) y
(empty? y) x
(< (first x) (first y))
(cons (first x) (lazy-seq (merges y (rest x))))
:else
(cons (first y) (lazy-seq (merges x (rest y))))))
([x y & more]
(apply merges
(for [[a b] (partition-all 2 (list* x y more))]
(merges a b)))))
(apply merges [[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]])
Edit: This revision merges pairwise up a binary tree for depth log number of sequences, rather than the prior linear reduction.
Upvotes: 1
Reputation: 14258
I slightly modified Rulle's answer to provide sorting on maps:
(defn sort-arrays-insert
([key dst x]
(let [x0 ( key (first x))
a (take-while #(< (key (first %)) x0) dst)
b (drop (count a) dst)]
(concat a [x] b))))
(defn sort-arrays-next
([key arrs]
(let [[f & r] arrs]
(if (<= (count f) 1)
r
(sort-arrays-insert key r (rest f))))))
(defn sort-arrays
([key arr0 arr1 & more]
(->> (apply list arr0 arr1 more)
(filter seq)
(sort-by (comp key first))
(iterate #(sort-arrays-next key %))
(take-while seq)
(map ffirst))))
(sort-arrays identity
[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10])
=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)
(sort-arrays :time
[{:time 1} {:time 4}]
[{:time 2} {:time 3}]
[{:time 3} {:time 5} {:time 7}]
[{:time 1} {:time 10}])
=> ({:time 1} {:time 1} {:time 2} {:time 3} {:time 3} {:time 4} {:time 5} {:time 7} {:time 10})
Also doing some benchmarking on Rulle, Peter and Leetwinski's answers, I found that Peter's answer is about twice as fast as the rest:
(let [L 20000
N 10]
(mapv (fn [f]
(let [arrs (vec (for [i (range N)]
(vec (range L))))]
(time (doall (f arrs)))))
[merge-sorted ;; Rulle's
merge-sorted-2 ;; Peter's
merge-sorted-3 ;; Leetwinski's
]))
"Elapsed time: 721.649222 msecs" ;; Rulle's
"Elapsed time: 373.058068 msecs" ;; Peter's
"Elapsed time: 754.717533 msecs" ;; Leetwinski's
Upvotes: 2
Reputation: 17859
there is also a nice way to do it by just counting items' frequencies into sorted map, and then unwrapping it with repeat
:
(def data [[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]])
(->> data
(apply concat)
(reduce #(update %1 %2 (fnil inc 0)) (sorted-map))
(mapcat (fn [[k v]] (repeat v k))))
;;=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)
also there is a less hacky approach:
(defn min-first-idx [data]
(when-let [items (->> data
(keep-indexed (fn [i x] (when (seq x) [x i])))
seq)]
(second (apply min-key ffirst items))))
(defn min-and-more [data-v]
(when-let [i (min-first-idx data-v)]
[(first (data-v i)) (update data-v i rest)]))
user> (min-and-more [[1 2 3] [0 1 4] [4 5]])
;; [0 [[1 2 3] (1 4) [4 5]]]
so you use it to iteratively take smallest item and rest from collection:
(->> [nil (vec data)]
(iterate (comp min-and-more second))
rest
(take-while (comp seq second))
(map first))
;; (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)
Upvotes: 3
Reputation: 4901
Clojure comes with a library of functions producing or operating on lazy sequences, such as map
, iterate
and take-while
. I believe a merge algorithm could be expressed by combining them, something like this.
(defn insert-into-sorted [dst x]
(let [x0 (first x)
a (take-while #(< (first %) x0) dst)
b (drop (count a) dst)]
(vec (concat a [x] b))))
(defn next-arrays [arrs]
(let [[f & r] arrs
restf (rest f)]
(if (empty? restf)
r
(insert-into-sorted r restf))))
(defn merge-sorted-arrays [arrs]
(->> arrs
(filter seq)
(sort-by first)
(iterate next-arrays)
(take-while seq)
(map ffirst)))
And we can call it like this:
(merge-sorted-arrays [[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]])
;; => (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)
It is true that you could do something like (sort (apply concat ...))
but that could turn out inefficient if you have a lot of data.
Update: A previous version of this code contained a call to count
that limited its applicability to merging sequences of finite length. By changing it to using empty?
instead, there is no such limitation and we can now use it to merge sequences of infinite length:
(take 12 (merge-sorted-arrays [(iterate (partial + 1.1) 1) (iterate (partial + 1.11) 1)]))
;; => (1 1 2.1 2.1100000000000003 3.2 3.2200000000000006 4.300000000000001 4.330000000000001 5.4 5.440000000000001 6.5 6.550000000000002)
Upvotes: 3