Reputation: 891
Is there a function in Haskell that will take in a list of any data type (specifically a String
) and check whether there are repeated elements in the list, i.e. repeated letters in the String
?
Upvotes: 1
Views: 1943
Reputation: 11208
Thats not too difficult to write yourself. A naive solution
import Data.List
isRepeat :: Ord a => [a] -> Bool
isRepeat = any ((>1) . length) . group . sort
Its possible to write a solution just using Eq
(using nub
as suggested by @luqui) but that would be atleast O(n^2) (I may be wrong). Using Ord
constraint gives you the flexibility of doing it in lesser complexity.
As pointed by @luqui the above solution is not lazy enough to work on infinite lists. It is one of the fastest solution amongst the solution pointed by him. Seems like using the Set version will give you best of both worlds.
Here are the benchmark results.
warming up
estimating clock resolution...
mean is 1.639027 us (320001 iterations)
found 2846 outliers among 319999 samples (0.9%)
1988 (0.6%) high severe
estimating cost of a clock call...
mean is 43.37523 ns (12 iterations)![enter image description here][1]
found 2 outliers among 12 samples (16.7%)
1 (8.3%) high mild
1 (8.3%) high severe
benchmarking Repeats in a list/Repeat with Sort
mean: 546.5002 us, lb 543.3960 us, ub 552.4869 us, ci 0.950
std dev: 21.33737 us, lb 12.23079 us, ub 35.02675 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
6 (6.0%) high mild
11 (11.0%) high severe
variance introduced by outliers: 35.597%
variance is moderately inflated by outliers
benchmarking Repeats in a list/Repeat with set
mean: 585.5344 us, lb 582.7982 us, ub 589.2482 us, ci 0.950
std dev: 16.17934 us, lb 12.98695 us, ub 20.36634 us, ci 0.950
found 14 outliers among 100 samples (14.0%)
10 (10.0%) high mild
4 (4.0%) high severe
variance introduced by outliers: 21.917%
variance is moderately inflated by outliers
benchmarking Repeats in a list/Repeat with nub
mean: 10.24364 ms, lb 10.12385 ms, ub 10.40203 ms, ci 0.950
std dev: 700.2262 us, lb 561.3143 us, ub 851.5999 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
2 (2.0%) high mild
15 (15.0%) high severe
variance introduced by outliers: 63.587%
variance is severely inflated by outliers
benchmarking Repeats in a list/Repeat with any
mean: 6.796171 ms, lb 6.712188 ms, ub 6.941347 ms, ci 0.950
std dev: 552.0155 us, lb 380.1686 us, ub 859.1697 us, ci 0.950
found 13 outliers among 100 samples (13.0%)
4 (4.0%) high mild
9 (9.0%) high severe
variance introduced by outliers: 71.738%
variance is severely inflated by outliers
The x-axis is in ms.
Upvotes: 2
Reputation: 60463
Here is a more advanced treatment. All the answers so far (including my other one) do not perform optimally when taking into account laziness. Here they are:
import Data.List (nub, group, sort)
hasDup_nub, hasDup_any :: (Eq a) => [a] -> Bool
hasDup_sort :: (Ord a) => [a] -> Bool
hasDup_nub xs = nub xs /= xs
hasDup_any [] = False
hasDup_any (x:xs) = any (x ==) xs || hasDup_any xs
hasDup_sort = any ((> 1) . length) . group . sort
A good heuristic to check whether something behaves well lazily -- that is, whether it will give an answer while evaluating as little as possible -- is to see how well it works on infinite lists. We can't expect any hasDup
to give an answer on hasDup [1,2..]
, because no duplicate ever occurs, but we never evaluate the whole list so it's impossible to tell whether there will be one in the future. However, hasDup [1,1..]
should certainly return True
-- the first two elements are even the same. Let's see how they fare:
>>> hasDup_nub [1,1..]
^CInterrupted.
>>> hasDup_any [1,1..]
True
>>> hasDup_sort [1,1..]
^CInterrupted.
nub [1,1..]
is [1
(if you will allow me to write it that way -- that is, it's a one followed by an infinite loop if you try to evaluate any more elements), so when checking whether the lists are equal, we check whether the second element is 1
, and we get into an infinite loop. sort [1,1..]
is just plain old bottom (another way of saying infinite loop), so we never get any information out of that. hasDup_any
does okay because it checks the first element against the rest of the list. But it is easily foiled:
>>> hasDup_any (1:[2,2..])
^CInterrupted.
We get stuck trying to evaluate any (1 ==) [2,2..]
, because we never find a 1
, but we don't know if we just have to keep looking farther.
All three of these do perform differently in the case of laziness. Here's one where hasDup_nub
returns when the others don't:
>>> hasDup_nub (1:2:2:[3,3..])
True
That's because nub (1:2:2:[3,3..]) = [1,2,3
(in my infinite loop notation), which is far enough to tell that it's not equal to [1,2,2,3,3,3..]
because they differ in the third element.
hasDup_sort
is the strictest -- when the other two don't return an answer, neither will it.
However, @Satvik's answer points out that hasNub_sort
has better complexity than the other two. Its asymptotic complexity is O(n log n), whereas the other two are O(n^2).
It turns out that there is a pretty easy solution with hasNub_sort
's asymptotic complexity, and better strictness properties than any of the other solutions. We just chomp along the list, recording what we've seen so far, and return if we ever come across an element we've already seen. If we use a Data.Set
to keep track of what we've seen, then insertion and testing against that set are O(log n) time, and we reach an O(n log n) solution.
import qualified Data.Set as Set
hasDup_set :: (Ord a) => [a] -> Bool
hasDup_set = go Set.empty
where
go seen [] = False
go seen (x:xs)
| x `Set.member` seen = True
| otherwise = go (Set.insert x seen) xs
It has no trouble with any of the infinite test cases so far:
>>> hasDup_set [1,1..]
True
>>> hasDup_set (1:[2,2..])
True
>>> hasDup_set (1:2:2:[3,3..])
True
Of course, if you give it an infinite list with no duplicates, it still can't tell. No computable function can (while still respecting referential transparency):
>>> hasDup_set [1,2..]
^CInterrupted.
I believe hasDup_set
is as good as you can do in terms of both asymptotic complexity and laziness* (but I'm definitely open to challenges). It's fortunate that it's possible to get them both at the same time -- sometimes there is an essential trade-off you have to make.
*** Without using unamb
, which can often achieve even greater levels of laziness. I have not discussed unamb
because it is very hard for me, and I haven't (nor has anyone, to my knowledge) come up with a good methodology for using it effectively. See here for an attempt, and an example of how complex and subtle it can get.
Upvotes: 4
Reputation: 8103
This solution will short circuit when a duplicate is found. I've included the definition of any
and (||)
from the Prelude
for completeness.
import Prelude hiding (any, (||))
x,y :: [Char]
x = "hello"
y = "helo"
main :: IO ()
main = mapM_ print [hasDupe x, hasDupe y]
hasDupe :: Eq a => [a] -> Bool
hasDupe [] = False
hasDupe (x:xs) = any (==x) xs || hasDupe xs
any :: (a -> Bool) -> [a] -> Bool
any f [] = False
any f (x:xs) = f x || any f xs
(||) :: Bool -> Bool -> Bool
True || _ = True
False || x = x
Upvotes: 1
Reputation: 60463
The nub
function from the Data.List
module
nub :: (Eq a) => [a] -> [a]
removes duplicates that have already been seen in a list, and preserves the order otherwise.
ghci> nub [1,3,5,3,6,5]
[1,3,5,6]
You can use this function to write the function you are looking for as a simple one-liner. Because of laziness, it's also as efficient as you can get when only using an Eq
constraint! (You can do better if you require Ord
, though)
Upvotes: 8