Piotr Lopusiewicz
Piotr Lopusiewicz

Reputation: 2594

Finding unique (as in only occurring once) element haskell

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others). Additionally I know that all elements in the list come from (small and known) set A. For example this function does the job for Ints:

unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
    where first [] = []
          first (x:xs) = x

ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]

This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small). Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)

This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.

The question is: how to do it correctly and efficiently in Haskell ?

Upvotes: 8

Views: 3086

Answers (6)

luqui
luqui

Reputation: 60533

Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.

import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative

Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).

counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
    where
    insert t x = IntTrie.modify' (fromEnum x) (+1) t

We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).

And then look for ones that are unique.

uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
    where
    cts = counts xs

This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.

uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).

Upvotes: 12

Petr
Petr

Reputation: 63409

As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.

If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:

import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)

count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m

add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m

uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
  where
    cs = foldl' add Map.empty xs

Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).

However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:

import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)

count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)

add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)

uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
    -- Count all elements into a hash table:
    ht <- HT.newSized (length xs)
    forM_ xs (add ht)
    -- Find the first one with count 1
    first (\x -> liftM (== 1) (count ht x)) xs


-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
  where
    f []        = return Nothing
    f (x:xs')   = do
        b <- p x
        if b then return (Just x)
             else f xs'

Notes:

  • If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.

Upvotes: 2

Stefan Holdermans
Stefan Holdermans

Reputation: 8050

Here is a version that does the trick:

unique :: Eq a => [a] -> [a]
unique =  select . collect []
  where
    collect acc []              = acc
    collect acc (x : xs)        = collect (insert x acc) xs

    insert x []                 = [[x]]
    insert x (ys@(y : _) : yss) 
      | x == y                  = (x : ys) : yss
      | otherwise               = ys : insert x yss

    select []                   = []
    select ([x] : _)            = [x]
    select ((_ : _) : xss)      = select xss

So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).

The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.

Upvotes: 1

zurgl
zurgl

Reputation: 1930

Something like this look pretty good.

unique = fst . foldl' (\(a, b) c -> if (c `elem` b) 
                                    then (a, b) 
                                    else if (c `elem` a) 
                                         then (delete c a, c:b) 
                                         else (c:a, b)) ([],[]) 

The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.

About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.

About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.


Below a nicer and more explicit version of the foldl' one.

import Data.List (foldl', delete, elem)

unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], []) 
  where 
    algorithm (result0, memory0) current = 
         if (current `elem` memory0)
         then (result0, memory0)
         else if (current`elem` result0)
              then (delete current result0, memory) 
              else (result, memory0) 
            where
                result = current : result0
                memory = current : memory0

Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.

unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], []) 
  where 
    algorithm (result, memory) current = 
         if (current `elem` memory)
         then (result, memory)
         else helper current result memory []
            where
               helper current [] [] acc = ([current], [])
               helper current [] memory acc = (acc, memory)
               helper current (r:rs) memory acc 
                   | current == r    = (acc ++ rs, current:memory) 
                   | otherwise = helper current rs memory (r:acc)

But the helper can be rewrite using fold as follow, which is definitely nicer.

helper current [] _ = ([current],[])
helper current memory result = 
    foldl' (\(r, m) x -> if x==current 
                         then (r, current:m) 
                         else (current:r, m)) ([], memory) $ result

Upvotes: 0

comingstorm
comingstorm

Reputation: 26117

First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.

Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.

If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:

firstUnique (x:xs)
  | elem x xs = firstUnique (filter (/= x) xs)
  | otherwise = Just x
firstUnique [] = Nothing

Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.


Edit:

The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...

However, an implementation that provides an early out in case of set exhaustion:

firstUnique = f [] [<small/known set of possible elements>] where
  f [] [] _ = Nothing  -- early out
  f uniques noshows (x:xs)
    | elem x uniques = f (delete x uniques) noshows xs
    | elem x noshows = f (x:uniques) (delete x noshows) xs
    | otherwise      = f uniques noshows xs
  f []    _ [] = Nothing
  f (u:_) _ [] = Just u

Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...

Upvotes: 2

C. A. McCann
C. A. McCann

Reputation: 77414

There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.

Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.

If you want this to work faster than O(something really horrible) you need that Ord constraint.


Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:

unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
    | x `notElem` candidates = unique candidates results xs
    | x `elem` results       = unique (delete x candidates) (delete x results) xs
    | otherwise              = unique candidates (x:results) xs

The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.

If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.

Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.

Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.

Upvotes: 6

Related Questions