Reputation: 891

Determine presence of duplicates in a list of any data type in Haskell

Is there a function in Haskell that will take in a list of any data type (specifically a String) and check whether there are repeated elements in the list, i.e. repeated letters in the String?

Upvotes: 1

Answers (4)

Satvik

Reputation: 11208

Thats not too difficult to write yourself. A naive solution

import Data.List

isRepeat :: Ord a => [a] -> Bool
isRepeat = any ((>1) . length) . group . sort

Its possible to write a solution just using Eq (using nub as suggested by @luqui) but that would be atleast O(n^2) (I may be wrong). Using Ord constraint gives you the flexibility of doing it in lesser complexity.

As pointed by @luqui the above solution is not lazy enough to work on infinite lists. It is one of the fastest solution amongst the solution pointed by him. Seems like using the Set version will give you best of both worlds.

Here are the benchmark results.

warming up
estimating clock resolution...
mean is 1.639027 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
  1988 (0.6%) high severe
estimating cost of a clock call...
mean is 43.37523 ns (12 iterations)![enter image description here][1]
found 2 outliers among 12 samples (16.7%)
  1 (8.3%) high mild
  1 (8.3%) high severe

benchmarking Repeats in a list/Repeat with Sort
mean: 546.5002 us, lb 543.3960 us, ub 552.4869 us, ci 0.950
std dev: 21.33737 us, lb 12.23079 us, ub 35.02675 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
  6 (6.0%) high mild
  11 (11.0%) high severe
variance introduced by outliers: 35.597%
variance is moderately inflated by outliers

benchmarking Repeats in a list/Repeat with set
mean: 585.5344 us, lb 582.7982 us, ub 589.2482 us, ci 0.950
std dev: 16.17934 us, lb 12.98695 us, ub 20.36634 us, ci 0.950
found 14 outliers among 100 samples (14.0%)
  10 (10.0%) high mild
  4 (4.0%) high severe
variance introduced by outliers: 21.917%
variance is moderately inflated by outliers

benchmarking Repeats in a list/Repeat with nub
mean: 10.24364 ms, lb 10.12385 ms, ub 10.40203 ms, ci 0.950
std dev: 700.2262 us, lb 561.3143 us, ub 851.5999 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
  2 (2.0%) high mild
  15 (15.0%) high severe
variance introduced by outliers: 63.587%
variance is severely inflated by outliers

benchmarking Repeats in a list/Repeat with any
mean: 6.796171 ms, lb 6.712188 ms, ub 6.941347 ms, ci 0.950
std dev: 552.0155 us, lb 380.1686 us, ub 859.1697 us, ci 0.950
found 13 outliers among 100 samples (13.0%)
  4 (4.0%) high mild
  9 (9.0%) high severe
variance introduced by outliers: 71.738%
variance is severely inflated by outliers

enter image description here

The x-axis is in ms.

Upvotes: 2

luqui

Reputation: 60463

Here is a more advanced treatment. All the answers so far (including my other one) do not perform optimally when taking into account laziness. Here they are:

import Data.List (nub, group, sort)

hasDup_nub, hasDup_any :: (Eq a)  => [a] -> Bool
hasDup_sort            :: (Ord a) => [a] -> Bool

hasDup_nub xs = nub xs /= xs

hasDup_any [] = False
hasDup_any (x:xs) = any (x ==) xs || hasDup_any xs

hasDup_sort = any ((> 1) . length) . group . sort

A good heuristic to check whether something behaves well lazily -- that is, whether it will give an answer while evaluating as little as possible -- is to see how well it works on infinite lists. We can't expect any hasDup to give an answer on hasDup [1,2..], because no duplicate ever occurs, but we never evaluate the whole list so it's impossible to tell whether there will be one in the future. However, hasDup [1,1..] should certainly return True -- the first two elements are even the same. Let's see how they fare:

>>> hasDup_nub [1,1..]
^CInterrupted.
>>> hasDup_any [1,1..]
True
>>> hasDup_sort [1,1..]
^CInterrupted.

nub [1,1..] is [1 (if you will allow me to write it that way -- that is, it's a one followed by an infinite loop if you try to evaluate any more elements), so when checking whether the lists are equal, we check whether the second element is 1, and we get into an infinite loop. sort [1,1..] is just plain old bottom (another way of saying infinite loop), so we never get any information out of that. hasDup_any does okay because it checks the first element against the rest of the list. But it is easily foiled:

>>> hasDup_any (1:[2,2..])
^CInterrupted.

We get stuck trying to evaluate any (1 ==) [2,2..], because we never find a 1, but we don't know if we just have to keep looking farther.

All three of these do perform differently in the case of laziness. Here's one where hasDup_nub returns when the others don't:

>>> hasDup_nub (1:2:2:[3,3..])
True

That's because nub (1:2:2:[3,3..]) = [1,2,3 (in my infinite loop notation), which is far enough to tell that it's not equal to [1,2,2,3,3,3..] because they differ in the third element.

hasDup_sort is the strictest -- when the other two don't return an answer, neither will it.

However, @Satvik's answer points out that hasNub_sort has better complexity than the other two. Its asymptotic complexity is O(n log n), whereas the other two are O(n^2).

It turns out that there is a pretty easy solution with hasNub_sort's asymptotic complexity, and better strictness properties than any of the other solutions. We just chomp along the list, recording what we've seen so far, and return if we ever come across an element we've already seen. If we use a Data.Set to keep track of what we've seen, then insertion and testing against that set are O(log n) time, and we reach an O(n log n) solution.

import qualified Data.Set as Set

hasDup_set :: (Ord a) => [a] -> Bool
hasDup_set = go Set.empty
    where
    go seen [] = False
    go seen (x:xs)
        | x `Set.member` seen = True
        | otherwise           = go (Set.insert x seen) xs

It has no trouble with any of the infinite test cases so far:

>>> hasDup_set [1,1..]
True
>>> hasDup_set (1:[2,2..])
True
>>> hasDup_set (1:2:2:[3,3..])
True

Of course, if you give it an infinite list with no duplicates, it still can't tell. No computable function can (while still respecting referential transparency):

>>> hasDup_set [1,2..]
^CInterrupted.

I believe hasDup_set is as good as you can do in terms of both asymptotic complexity and laziness* (but I'm definitely open to challenges). It's fortunate that it's possible to get them both at the same time -- sometimes there is an essential trade-off you have to make.

*** Without using unamb, which can often achieve even greater levels of laziness. I have not discussed unamb because it is very hard for me, and I haven't (nor has anyone, to my knowledge) come up with a good methodology for using it effectively. See here for an attempt, and an example of how complex and subtle it can get.

Upvotes: 4

The Internet

Reputation: 8103

This solution will short circuit when a duplicate is found. I've included the definition of any and (||) from the Prelude for completeness.

import           Prelude hiding (any, (||))

x,y :: [Char]
x = "hello"
y = "helo"

main :: IO ()
main = mapM_ print [hasDupe x, hasDupe y]

hasDupe :: Eq a => [a] -> Bool
hasDupe [] = False
hasDupe (x:xs) = any (==x) xs || hasDupe xs

any :: (a -> Bool) -> [a] -> Bool
any f [] = False
any f (x:xs) = f x || any f xs

(||)                    :: Bool -> Bool -> Bool
True  || _              =  True
False || x              =  x

Upvotes: 1

luqui

Reputation: 60463

The nub function from the Data.List module

nub :: (Eq a) => [a] -> [a]

removes duplicates that have already been seen in a list, and preserves the order otherwise.

ghci> nub [1,3,5,3,6,5]
[1,3,5,6]

You can use this function to write the function you are looking for as a simple one-liner. Because of laziness, it's also as efficient as you can get when only using an Eq constraint! (You can do better if you require Ord, though)

Upvotes: 8

Determine presence of duplicates in a list of any data type in Haskell

Answers (4)

Related Questions