mikkom
mikkom

Reputation: 3630

clojure: remove case-insensitive string duplicates

What is the idiomatic clojure way to remove strings from an array of strings if there is case-insensitive match?

I need to preserve the case for the results (I always want to preserve the first occurence of insensitive match).

Simple example:

(distinct-case-insensitive ["fish" "Dog" "cat"] ["FISH "DOG"])

would return

["fish" "Dog" "cat"]

Upvotes: 1

Views: 1601

Answers (4)

Xavi
Xavi

Reputation: 603

A solution is to implement a distinct-by that allows to specify a function to apply on each element before checking the duplicates

(defn distinct-by [f coll]
  (let [groups (group-by f coll)]
    (map #(first (groups %)) (distinct (map f coll)))))

For the example case this can be used like

(distinct-by clojure.string/lower-case
             (concat ["fish" "Dog" "cat"] ["FISH" "DOG"]))
; => ("fish" "Dog" "cat") 

Upvotes: 0

jbm
jbm

Reputation: 2600

Here's a solution that meets your requirements (the first matching item "wins" and order is preserved), is lazy, and has the benefit of being a higher order function. It takes a keyfn as its first argument, in correspondence with e.g. sort-by and group-by.

(defn distinct-by [keyfn coll]
  (letfn [(step [xs seen]
            (lazy-seq
             ((fn [xs]
                (when-let [[x & more] (seq xs)]
                  (let [k (keyfn x)]
                    (if (seen k)
                      (recur more)
                      (cons x (step more (conj seen k)))))))
              xs)))]
    (step coll #{})))

So your usage would be:

(require '[clojure.string :as str])
(distinct-by str/lower-case ["fish" "Dog" "cat" "Fish" "DOG"])
;=> ("fish" "Dog" "cat")

The use of recur and the inner anonymous function is a relatively minor optimization. clojure.core/distinct uses it but in many cases it's not necessary. Here's a version without the extra noise:

(defn distinct-by [keyfn coll]
  (letfn [(step [xs seen]
            (lazy-seq
             (when-let [[x & more] (seq xs)]
               (let [k (keyfn x)]
                 (if (seen k)
                   (step more seen)
                   (cons x (step more (conj seen k))))))))]
    (step coll #{})))

Upvotes: 1

Leonid Beschastny
Leonid Beschastny

Reputation: 51480

Greedy solution

Obviously, you can't use build-in distinct here, so you should reimplement it yourself.

mishadoff's solution is really beautiful and clujur'ish, but it breaks the order of elements when there are more then 8 unique elements dye to clojure HashMap implementation.

The safest way to do what you want is to use reduce:

(defn concat-distinct [& colls]
  (first
    (reduce (fn [[coll seen] el]
              (let [lc-el (string/lower-case el)]
                (if (contains? seen lc-el)
                    [coll seen]
                    [(conj coll el) (conj seen lc-el)])))
            [[] #{}]
            (apply concat colls))))

If works for any number of collections:

user=> (concat-distinct ["fish" "Dog" "cat"] ["FISH" "DOG"] ["snake"] ["CaT" "Camel"])
["fish" "Dog" "cat" "snake" "Camel"]

And for any number of distinct elements (unlike mishadoff's solution):

user=> (concat-distinct ["g" "h" "i" "j" "a" "b" "c" "d" "e" "f"])
["g" "h" "i" "j" "a" "b" "c" "d" "e" "f"]

Lazy solution

In most cases you'll be fine with greedy solution. But if you want it to be lazy, then you won't be able to avoid recursion:

(defn lazy-concat-distinct [& colls]
  ((fn step [coll seen]
      (lazy-seq
        (loop [[el & xs :as s] coll]
          (when (seq s)
            (let [lc-el (string/lower-case el)]
              (if (contains? seen lc-el)
                  (recur xs)
                  (cons el (step xs (conj seen lc-el)))))))))
    (apply concat colls) #{}))

This solution uses lazy sequences:

user=> (def res (lazy-concat-distinct (lazy-seq (println :boo) ["boo"])))
user=> (count res)
:boo
1

You can make it even lazier using lazy-cat macro:

(defmacro lazy-concat-distinct* [& colls]
  `(lazy-concat-distinct (lazy-cat ~@colls)))

Now it won't even evaluate it's arguments until they are actually used:

user=> (def res (lazy-concat-distinct* (do (println :boo) ["boo"])))
user=> (count res)
:boo
1

It's useful when you want to aggregate data from some large database without downloading it all at once.

N.B. Be careful with lazy solutions. For example, this solution works almost 4 times slower than the greedy one.

Upvotes: 3

mishadoff
mishadoff

Reputation: 10789

This is solution I came up with. To simplify function it accepts just one list with duplicates, so if you need vararg lists (apply concat lists) before.

(defn distinct-case-insensitive [xs]
  (->> xs
       (group-by clojure.string/lower-case)
       (vals)
       (map first)))

(distinct-case-insensitive ["fish" "Dog" "cat" "Fish" "DOG"]) => 
("fish" "Dog" "cat")

But, as Leonid mentioned it does not preserve order due to hashmap. For ordered solution use

(defn distinct-case-insesitive [xs]
    (->> xs
         (group-by clojure.string/lower-case)
         (#(map % (map clojure.string/lower-case xs)))
         (map first)
         (distinct)))

Upvotes: 5

Related Questions