Drew Noakes
Drew Noakes

Reputation: 310977

Full outer join two sequences of maps in Clojure

Given:

(def seq1 ({:id 1 :val 10} {:id 2 :val 20}))
(def seq2 ({:id 1 :val 12} {:id 3 :val 30}))

Within each sequence, the :id value is guaranteed to be unique in that sequence, though not necessarily ordered.

How can these two sequences of maps be joined by the :id key, so that the result is as shown?

{1 [{:id 1 :val 10} {:id 1 :val 12}],
 2 [{:id 2 :val 20} nil            ],
 3 [nil             {:id 3 :val 30}]}

The ultimate result is a map of pairs. This is similar to a full outer join, where not only the intersection, but also the difference of the two sets is included in the output.

Upvotes: 1

Views: 773

Answers (4)

Michał Marczyk
Michał Marczyk

Reputation: 84341

Here's another version, which in my benchmarks using input sizes of 2+2, 90+90, 900+900, 90000+99000 and 300000+300000 is the fastest so far.

(defn outer-join [k xs ys]
  (let [gxs (group-by #(get % k) xs)
        gys (group-by #(get % k) ys)
        kvs (concat (keys gxs) (keys gys))]
    (persistent!
     (reduce (fn [out k]
               (let [l (first (get gxs k))
                     r (first (get gys k))]
                 (assoc! out k [l r])))
             (transient {})
             kvs))))

(I experimented with wrapping the key seq in distinct, but it turned out to result in a slowdown in benchmarks involving small-to-moderately-large inputs. This makes sense: we need to walk both key seqs anyway and the amount of work we do per key is so tiny that it may well be more work to avoid it.)

Here is a sanity check and a handful of Criterium benchmarks (with amalloy's version renamed to outer-join*):

(let [xs [{:id 1 :val 10} {:id 2 :val 20}]
      ys [{:id 1 :val 12} {:id 3 :val 30}]]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
WARNING: Final GC required 3.296446000194027 % of runtime
Evaluation count : 17099160 in 60 samples of 284986 calls.
             Execution time mean : 3.589256 µs
    Execution time std-deviation : 34.976485 ns
   Execution time lower quantile : 3.544196 µs ( 2.5%)
   Execution time upper quantile : 3.666515 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 6596160 in 60 samples of 109936 calls.
             Execution time mean : 9.107578 µs
    Execution time std-deviation : 82.176826 ns
   Execution time lower quantile : 8.993900 µs ( 2.5%)
   Execution time upper quantile : 9.295188 µs (97.5%)
                   Overhead used : 2.295807 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 9298740 in 60 samples of 154979 calls.
             Execution time mean : 6.592289 µs
    Execution time std-deviation : 63.929382 ns
   Execution time lower quantile : 6.506403 µs ( 2.5%)
   Execution time upper quantile : 6.749262 µs (97.5%)
                   Overhead used : 2.295807 ns

Found 4 outliers in 60 samples (6.6667 %)
    low-severe   4 (6.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 90))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 10 100))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
Evaluation count : 413760 in 60 samples of 6896 calls.
             Execution time mean : 147.182107 µs
    Execution time std-deviation : 1.282179 µs
   Execution time lower quantile : 145.103445 µs ( 2.5%)
   Execution time upper quantile : 149.658348 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 256920 in 60 samples of 4282 calls.
             Execution time mean : 238.166905 µs
    Execution time std-deviation : 1.987980 µs
   Execution time lower quantile : 235.211277 µs ( 2.5%)
   Execution time upper quantile : 242.255072 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 362760 in 60 samples of 6046 calls.
             Execution time mean : 167.301109 µs
    Execution time std-deviation : 1.616075 µs
   Execution time lower quantile : 164.534670 µs ( 2.5%)
   Execution time upper quantile : 170.757257 µs (97.5%)
                   Overhead used : 2.295807 ns

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 900))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 100 1000))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
Evaluation count : 33840 in 60 samples of 564 calls.
             Execution time mean : 1.754723 ms
    Execution time std-deviation : 29.229644 µs
   Execution time lower quantile : 1.709219 ms ( 2.5%)
   Execution time upper quantile : 1.805009 ms (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 22740 in 60 samples of 379 calls.
             Execution time mean : 2.559172 ms
    Execution time std-deviation : 44.520222 µs
   Execution time lower quantile : 2.490201 ms ( 2.5%)
   Execution time upper quantile : 2.657706 ms (97.5%)
                   Overhead used : 2.295807 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 6.2842 % Variance is slightly inflated by outliers
Evaluation count : 30000 in 60 samples of 500 calls.
             Execution time mean : 1.999194 ms
    Execution time std-deviation : 25.723647 µs
   Execution time lower quantile : 1.962350 ms ( 2.5%)
   Execution time upper quantile : 2.045836 ms (97.5%)
                   Overhead used : 2.295807 ns

Huge inputs (excluding outer-join-maps):

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 300000))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 100000 400000))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys)))
WARNING: Final GC required 13.371566110062922 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 772.532296 ms
    Execution time std-deviation : 12.710681 ms
   Execution time lower quantile : 744.832577 ms ( 2.5%)
   Execution time upper quantile : 801.098417 ms (97.5%)
                   Overhead used : 2.295807 ns

Found 6 outliers in 60 samples (10.0000 %)
    low-severe   2 (3.3333 %)
    low-mild     3 (5.0000 %)
    high-mild    1 (1.6667 %)
 Variance from outliers : 5.3156 % Variance is slightly inflated by outliers
WARNING: Final GC required 15.51698960336361 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 949.508151 ms
    Execution time std-deviation : 32.952708 ms
   Execution time lower quantile : 911.054447 ms ( 2.5%)
   Execution time upper quantile : 1.031623 sec (97.5%)
                   Overhead used : 2.295807 ns

Found 4 outliers in 60 samples (6.6667 %)
    low-severe   4 (6.6667 %)
 Variance from outliers : 20.6517 % Variance is moderately inflated by outliers

Upvotes: 2

Arthur Ulfeldt
Arthur Ulfeldt

Reputation: 91554

If you don't require the nil's for each map which does not have a given key then merge-with can handle this problem fairly easily.

user> (def seq1 [{:id 1 :val 10} {:id 2 :val 20}])
#'user/seq1 

user> (def seq2 [{:id 1 :val 12} {:id 3 :val 30}]) 
#'user/seq2 

user> (def data (concat seq1 seq2)) 
#'user/data

user> (reduce (partial merge-with (comp vec concat)) 
              (map #(hash-map (:id %) [%]) data))

{1 [{:val 10, :id 1} {:val 12, :id 1}], 
 2 [{:val 20, :id 2}], 
 3 [{:val 30, :id 3}]}

Upvotes: 1

amalloy
amalloy

Reputation: 91907

Your answer is as good as anything else, really, but I would write it as

(defn outer-join [field a b]
  (let [lookup #(get % field)
        indexed (for [coll [a b]]
                  (into {} (map (juxt lookup identity) coll)))]
    (into {} (for [key (distinct (mapcat keys indexed))]
               [key (map #(get % key) indexed)]))))

Upvotes: 2

Drew Noakes
Drew Noakes

Reputation: 310977

Here's the answer I came up with, however I'm sure it can be made more elegant or potentially to have better performance.

(defn seq-to-map [seq key]
  (into {} (map (fn [{id key :as m}] [id m]) seq)))

(defn outer-join-maps [seq1 seq2 key]
  (let [map1 (seq-to-map seq1 key)
        map2 (seq-to-map seq2 key)
        allkeys (set (clojure.set/union (keys map1) (keys map2)))]
    (into {} (map (fn [k] [k [(get map1 k) (get map2 k)]]) allkeys))))

The following tests pass:

(fact {:midje/description "Sequence to map"}

      (seq-to-map [{:a 1, :b 1} {:a 2, :b 2}] :a)
        => {1 {:a 1, :b 1}, 2 {:a 2, :b 2}}

      (seq-to-map [{:a 1, :b 1} {:a 1, :b 2}] :a)
        => {1 {:a 1, :b 2}} ; takes last value when a key is repeated

      (seq-to-map [] :a)
        => {})

(fact {:midje/description "Sequence merging"}
      (let [seq1 [{:id 1 :val 10} {:id 2 :val 20}]
            seq2 [{:id 1 :val 12} {:id 3 :val 30}]]

        (outer-join-maps seq1 seq2 :id)
          => {1 [{:id 1 :val 10} {:id 1 :val 12}],
              2 [{:id 2 :val 20} nil],
              3 [nil {:id 3 :val 30}]}))

Upvotes: 3

Related Questions