zcaudate
zcaudate

Reputation: 14258

fast sorting algorithm for multiple ordered arrays

I have 4 arrays, that are ordered. I'd like to be able to combine them together into a single sorted datastructure and lazily take from that.

Is there an efficient way of doing this?

[1 3 4 6 9 10 15]
[2 3 6 7 8 9 10]
[1 3 6 7 8 9 10]
[1 2 3 4 8 9 10]

=> [1 1 1 2 2 3 3 3 3 4]

Upvotes: 4

Views: 428

Answers (5)

peter pun
peter pun

Reputation: 394

Both Rulle's solution and leetwinski's second one use iterate in quite an anamorphic manner (especially the latter). Let's define unfold using iterate (usually the opposite is done) and write an explicitly anamorphic solution:

(defn unfold [f s]
  (->> s
       (list nil)
       (iterate (comp f second))
       rest
       (take-while some?)
       (map first)))

(defn merge-sorted [s]
  (->> s
       (filter seq)
       (unfold
         (fn [s]
           (if (seq s)
             (loop [[[mf & mn :as m] & s] s, r ()]
               (if-let [[[xf :as x] & s] s]
                 (let [[m x] (if (< xf mf) [x m] [m x])]
                   (recur (cons m s) (cons x r)))
                 (list mf (if mn (cons mn r) r)))))))))

UPDATE

Here is a version of merge-sorted that uses reduce instead of loop and recur:

(defn merge-sorted [s]
  (->> s
       (filter seq)
       (unfold
         (fn [s]
           (if (seq s)
             (let [[[mf & mn] r]
                   (reduce
                     (fn [[m r] x]
                       (if (< (first x) (first m))
                         [x (cons m r)]
                         [m (cons x r)]))
                     [(first s) ()]
                     (rest s))]
               (list mf (if mn (cons mn r) r))))))))

UPDATE'

Impressed by the efficiency of A. Webb's solution after the edit and considering this problem interesting and relatively important, I had a look at the Wikipedia articles on merge and k-way merge algorithms and at this paper. I found out that there is plenty of room for analysis/experimentation/improvement and decided to (re)implement and test several algorithms. Here they are, packed in a map, preceded by some helper functions and followed by some functions useful for testing:

(require ['clojure.core.reducers :as 'reducers])

(defn mapmap [f m]
  (reduce #(update %1 %2 f) m (keys m)))

(defn unfold [f s]
  (->> s
       (list nil)
       (iterate (comp f second))
       rest
       (take-while some?)
       (map first)))

(defn foldh [f s]
  ((fn rec [v]
     (f (if (> (count v) 2)
          (let [h (quot (count v) 2)]
            (map rec [(subvec v 0 h) (subvec v h)]))
          v)))
   (vec s)))

(defn fold2 [f s]
  (loop [s s]
    (if (nnext s)
      (recur (map f (partition-all 2 s)))
      (f s))))

(def merge-sorted
  (merge
    ;direct lazy algorithms
    (mapmap
      (fn [[prepare choose insert]]
        (fn [s]
          (->> s
               (filter seq)
               prepare
               (unfold
                 (fn [s]
                   (if (seq s)
                     (let [[[xf & xn] s] (choose s)]
                       [xf (if xn (insert s xn) s)])))))))
      {:min
       [identity
        (fn [s]
          (reduce
            (fn [[x s] y]
              (if (< (first x) (first y))
                [x (cons y s)]
                [y (cons x s)]))
            [(first s) ()]
            (rest s)))
        conj]
       :sort
       [(partial sort-by first)
        (juxt first rest)
        (fn [s [xf :as x]]
          (let [[a b] (loop [a () b (seq s)]
                        (if-let [[bf & bn] b]
                          (if (< (first bf) xf)
                            (recur (cons bf a) bn)
                            [a b])
                          [a b]))]
            (into (cons x b) a)))]
       :lsort
       [(partial sort-by first)
        (juxt first rest)
        (fn [s [xf :as x]]
          ((fn rec [s]
             (lazy-seq
               (if-let [[sf] (seq s)]
                 (if (< (first sf) xf)
                   (cons sf (rec (rest s)))
                   (cons x s))
                 (list x))))
           s))]
       :heap
       [(fn [s]
          (let [h (java.util.PriorityQueue.
                    (count s)
                    #(< (first %1) (first %2)))]
            (run! #(.add h %) s)
            h))
        (fn [h] [(.poll h) h])
        (fn [h x] (.add h x) h)]})
    ;folding lazy algorithms
    (mapmap
      (letfn [(merge2 [s]
                (lazy-seq
                  (if-let [[x & s] (seq (filter seq s))]
                    (if-let [[y] s]
                      ((fn rec [x y]
                         (lazy-seq
                           (let [[[xf & xn] y]
                                 (if (< (first x) (first y))
                                   [x y]
                                   [y x])]
                             (cons xf (if xn (rec xn y) y)))))
                       x y)
                      x))))]
        (fn [fold] (partial fold merge2)))
      {:foldl #(reduce (comp %1 list) %2)
       :foldh foldh
       :fold2 fold2})
    ;folding eager algorithms
    (mapmap
      (letfn [(merge2 [s]
                (if-let [[x & s] (seq (filter seq s))]
                  (if-let [[y] s]
                    (loop [x x y y acc ()]
                      (let [[[xf & xn] y]
                            (if (< (first x) (first y))
                              [x y]
                              [y x])
                            acc (conj acc xf)]
                        (if xn
                          (recur xn y acc)
                          (into y acc))))
                    x)
                  ()))]
        (fn [fold] (partial fold merge2)))
      {:efoldp #(reducers/fold 2 (comp %1 list) (comp %1 list) (vec %2))
       :efoldh foldh
       :efold2 fold2})))

(defn gen-inp [m n]
  (->> 0
       (repeat m)
       (map
         (comp
           doall
           (partial take n)
           rest
           (partial iterate #(+ % (rand-int 100)))))
       doall))

(defn test-merge-sorted [m n & algs]
   (->> (or algs (sort (keys merge-sorted)))
        (map (juxt name merge-sorted))
        (run!
          (let [inp (gen-inp m n)]
            (fn [[id alg]]
              (println id)
              ;(java.lang.System/gc)
              (try
                (time (doall (alg inp)))
                (catch java.lang.StackOverflowError _
                  (prn "Stack overflow"))))))))

The direct lazy algorithms follow a common scheme parameterized by how the following are done:

  • preprocessing the input
  • computing one chosen sequence and the rest ones
  • inserting the tail of the chosen sequence into the rest ones

:min is like my first solution, which calculates a minimum at each iteration.

:sort is like Rulle's solution, which sorts the sequences initially and makes a sorted-insertion at each iteration.

:lsort is like :sort but with lazy insertion. It can cause stack overflow because of nested lazy sequences.

:heap is a simple but suboptimal implementation of heap-merge using Java's PriorityQueues.

The folding lazy algorithms follow a common scheme parameterized by how a <=2-ary merge is extended to arbitrary arities.

:foldl is like A. Webb's solution before the edit, which does a left folding using reduce. It can cause stack overflow because of nested lazy sequences.

:foldh is an implementation of divide-and-conquer-merge, which folds by splitting in half.

:fold2 is like A. Webb's solution after the edit, which folds by splitting in pairs.

The folding eager algorithms follow a scheme like that of the lazy ones but using an eager <=2-ary merge.

:efoldp is a parallel implementation of divide-and-conquer-merge using clojure.core.reducers/fold, which does various <=2-ary merges concurrently and possibly in parallel, by "forking" every time it splits in half.

:efoldh and :efold2 are like :foldh and :fold2 but eager.

As a short remark on performance I would say that, for fast lazy merging, one of :foldh, :fold2 or maybe :heap should be chosen. As for eager merging, depending on the hardware's parallelism ability and on the input's shape, :efoldp can be faster than the rest. For more information see the linked articles.

Upvotes: 1

A. Webb
A. Webb

Reputation: 26436

You can just write this out explicitly, for a lazy performant version.

(defn merges 
  ([x] x)

  ([x y]
    (cond 
      (empty? x) y
      (empty? y) x
      (< (first x) (first y)) 
        (cons (first x) (lazy-seq (merges y (rest x))))       
     :else 
      (cons (first y) (lazy-seq (merges x (rest y))))))

   ([x y & more]
     (apply merges 
       (for [[a b] (partition-all 2 (list* x y more))]
         (merges a b)))))


(apply merges [[1 3 4 6 9 10 15]
               [2 3 6 7 8 9 10]
               [1 3 6 7 8 9 10]
               [1 2 3 4 8 9 10]])

Edit: This revision merges pairwise up a binary tree for depth log number of sequences, rather than the prior linear reduction.

Upvotes: 1

zcaudate
zcaudate

Reputation: 14258

I slightly modified Rulle's answer to provide sorting on maps:

(defn sort-arrays-insert
  ([key dst x]
   (let [x0 ( key (first x))
         a  (take-while #(< (key (first %)) x0) dst)
         b  (drop (count a) dst)]
     (concat a [x] b))))

(defn sort-arrays-next
  ([key arrs]
   (let [[f & r] arrs]
     (if (<= (count f) 1)
       r
       (sort-arrays-insert key r (rest f))))))

(defn sort-arrays
  ([key arr0 arr1 & more]
   (->> (apply list arr0 arr1 more)
        (filter seq)
        (sort-by (comp key first))
        (iterate #(sort-arrays-next key %))
        (take-while seq)
        (map ffirst))))
(sort-arrays identity
             [1 3 4 6 9 10 15]
             [2 3 6 7 8 9 10]
             [1 3 6 7 8 9 10]
             [1 2 3 4 8 9 10])

=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)
(sort-arrays :time
             [{:time 1} {:time 4}]
             [{:time 2} {:time 3}]
             [{:time 3} {:time 5} {:time 7}]
             [{:time 1} {:time 10}])

=> ({:time 1} {:time 1} {:time 2} {:time 3} {:time 3} {:time 4} {:time 5} {:time 7} {:time 10})

Also doing some benchmarking on Rulle, Peter and Leetwinski's answers, I found that Peter's answer is about twice as fast as the rest:

(let [L 20000
        N 10]
    (mapv (fn [f]
            (let [arrs (vec (for [i (range N)]
                              (vec (range L))))]
              (time (doall (f arrs)))))
          [merge-sorted       ;; Rulle's
           merge-sorted-2     ;; Peter's
           merge-sorted-3     ;; Leetwinski's
]))


"Elapsed time: 721.649222 msecs"  ;; Rulle's
"Elapsed time: 373.058068 msecs"  ;; Peter's
"Elapsed time: 754.717533 msecs"  ;; Leetwinski's

Upvotes: 2

leetwinski
leetwinski

Reputation: 17859

there is also a nice way to do it by just counting items' frequencies into sorted map, and then unwrapping it with repeat:

(def data [[1 3 4 6 9 10 15]
           [2 3 6 7 8 9 10]
           [1 3 6 7 8 9 10]
           [1 2 3 4 8 9 10]])

(->> data
     (apply concat)
     (reduce #(update %1 %2 (fnil inc 0)) (sorted-map))
     (mapcat (fn [[k v]] (repeat v k))))

;;=> (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

also there is a less hacky approach:

(defn min-first-idx [data]
  (when-let [items (->> data
                        (keep-indexed (fn [i x] (when (seq x) [x i])))
                        seq)]    
    (second (apply min-key ffirst items))))

(defn min-and-more [data-v]
  (when-let [i (min-first-idx data-v)]
    [(first (data-v i)) (update data-v i rest)]))

user> (min-and-more [[1 2 3] [0 1 4] [4 5]])
;; [0 [[1 2 3] (1 4) [4 5]]]

so you use it to iteratively take smallest item and rest from collection:

(->> [nil (vec data)]      
     (iterate (comp min-and-more second))
     rest
     (take-while (comp seq second))
     (map first))

;; (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

Upvotes: 3

Rulle
Rulle

Reputation: 4901

Clojure comes with a library of functions producing or operating on lazy sequences, such as map, iterate and take-while. I believe a merge algorithm could be expressed by combining them, something like this.

(defn insert-into-sorted [dst x]
  (let [x0 (first x)
        a (take-while #(< (first %) x0) dst)
        b (drop (count a) dst)]
    (vec (concat a [x] b))))

(defn next-arrays [arrs]
  (let [[f & r] arrs
        restf (rest f)]
    (if (empty? restf)
      r
      (insert-into-sorted r restf))))

(defn merge-sorted-arrays [arrs]
  (->> arrs
       (filter seq)
       (sort-by first)
       (iterate next-arrays)
       (take-while seq)
       (map ffirst)))

And we can call it like this:

(merge-sorted-arrays [[1 3 4 6 9 10 15]
                      [2 3 6 7 8 9 10]
                      [1 3 6 7 8 9 10]
                      [1 2 3 4 8 9 10]])
;; => (1 1 1 2 2 3 3 3 3 4 4 6 6 6 7 7 8 8 8 9 9 9 9 10 10 10 10 15)

It is true that you could do something like (sort (apply concat ...)) but that could turn out inefficient if you have a lot of data.

Update: A previous version of this code contained a call to count that limited its applicability to merging sequences of finite length. By changing it to using empty? instead, there is no such limitation and we can now use it to merge sequences of infinite length:

(take 12 (merge-sorted-arrays [(iterate (partial + 1.1) 1) (iterate (partial + 1.11) 1)]))
;; => (1 1 2.1 2.1100000000000003 3.2 3.2200000000000006 4.300000000000001 4.330000000000001 5.4 5.440000000000001 6.5 6.550000000000002)

Upvotes: 3

Related Questions