Reputation: 7815
Since a code example is worth a thousand words I'll start with that:
testList = [1,2,2,3,4,5]
testSet = map sumMapper $ tails testList
where sumMapper [] = []
sumMapper (a:b) = sumMap a b
sumMap a b = map (+ a) b
This code takes a list and adds up all the elements to get the sum of all of them (I'd also be interested in efficiency of this). The output of testSet is:
[[3,3,4,5,6],[4,5,6,7],[5,6,7],[7,8],[9],[],[]]
I would like to find the union of these lists (to make it into a set) but I feel that:
whatIWant = foldl1 union testSet
will have bad performance (the real lists will be thousands of elements long).
Is this the correct solution or am I missing something obvious?
Upvotes: 5
Views: 1667
Reputation: 71074
Since union
is an associative operation (a+(b+c)==(a+b)+c), you can use tree-shaped folding for a logarithmic advantage in time complexity:
_U [] = []
_U (xs:t) = union xs (_U (pairs t))
pairs (xs:ys:t) = union xs ys : pairs t
pairs t = t
Of course Data.List.union
itself is O(n2) in general, but if your testList
is ordered non-decreasing, all the lists will be too, and you can use a linear ordUnion
instead of the union
, for a solution which is linearithmic overall and shouldn't leak space:
ordUnion :: (Ord a) => [a] -> [a] -> [a]
ordUnion a [] = a
ordUnion [] b = b
ordUnion (x:xs) (y:ys) = case compare x y of
LT -> x : ordUnion xs (y:ys)
EQ -> x : ordUnion xs ys
GT -> y : ordUnion (x:xs) ys
To prevent duplicates which might slip through, one more function is needed to process _U
's output—a linear ordNub :: (Ord a) => [a] -> [a]
, with an obvious implementation.
Using the left-preferential (\(x:xs) ys -> x:ordUnion xs ys)
could be even more productive overall (force smaller portions of the input at each given moment):
g testList = ordNub . _U $ [map (+ a) b | (a:b) <- tails testList]
where
_U [] = []
_U ((x:xs):t) = x : ordUnion xs (_U (pairs t))
pairs ((x:xs):ys:t) = (x : ordUnion xs ys) : pairs t
pairs t = t
see also:
Upvotes: 1
Reputation: 15605
If you're using elements that are members of the Ord
typeclass, as in your example, you can use Data.Set
:
import qualified Data.Set as Set
whatYouWant = foldl' (Set.union . Set.fromList) Set.empty testSet
This has the advantage of taking space proportional to the size of the largest sublist rather than to the size of the entire concatenated list as does the Set.fromList . concat
solution. The strict foldl'
also prevents buildup of unevaluated thunks, preventing O(n)
stack and heap space usage.
Generally speaking, an Ord
constraint allows more efficient algorithms than an Eq
constraint because it allows you to build a tree. This is also the reason that nub
is O(n^2)
: the more efficient algorithm requires Ord
rather than just Eq
.
Upvotes: 5
Reputation: 12070
You might want to try
nub $ concat theListOfLists
In the version using union
, the code to cut out duplicates will get run many times. Here it only is run once.
It will only execute the code to pull out the unique values once.
There is also a Data.Set library, you could alternatively use
import Data.Set
S.fromList $ concat theListOfLists
The important point is that the code (here and above) that pulls out duplicates only gets run on the full list once, rather than over and over again.
edit- Rein mentions below that nub is O(n^2), so you should avoid the first solution above in favor of something O(n log n), as Data.Set.fromList should be. As others have mentioned in the comments, you need something that enforces Ord a
to get the proper complexity O(n log n)
, and Data.Set does, nub does not.
I will leave the two solutions (poor performance and good performance) because I think the resulting discussion was useful.
Upvotes: 5