How to improve performence of this Haskell code?

Question

I'm facing the following problem :

From the initial set [1,2,3,4] compute all possible subsets i.e [[1],[2],[3],[4],[1,2],[1,3],[1,4],[2,3],[2,4],[3,4],[1,2,3],[1,2,4],[1,3,4],[2,3,4],[1,2,3,4]]

I've wrote the following Haskell program generate.hs which is correct.

generateSets :: Eq a => [a] -> [[a]] -> [[a]] -> [[a]]
generateSets []  _  _  = []
generateSets src [] _  = let isets = growthup [] src in generateSets src iset iset
generateSets src sets rsets = if null sets' then rsets else generateSets src sets' (rsets++sets')
  where sets' = concatMap (flip growthup src) sets

growthup :: (Eq a) => [a] -> [a] -> [[a]]
growthup ps ss = map (\suf -> ps++[suf]) ss'
  where ss' = nextoccurence ps ss

nextoccurence :: (Eq a) => [a] -> [a] -> [a]
nextoccurence [] ys = ys
nextoccurence xs ys = tail ys'
  where ys' = dropWhile (/= last xs) ys

While executing it in the GHC interpreter ghci ...

ghci> generate [1,2,3,4] [] []
ghci> [[1],[2],[3],[4],[1,2],[1,3],[1,4],[2,3],[2,4],[3,4],[1,2,3],[1,2,4],[1,3,4],[2,3,4],[1,2,3,4]]

every thing goes fine but the program take too long for just small sets of size 30 for example.

My question is : It is possible to improve my code in order to gain more from haskell laziness, or garbagge collector or something else ?

Is my code a good candidate for parallelism ?

Thanks for any reply !

hammar · Accepted Answer

Sets have a lot of subsets. In fact, a set of n elements has 2ⁿ subsets, so a set of 30 elements has over one billion subsets. Whichever method you use to generate them, even iterating over the results is going to take a long time. For larger sets you can pretty much forget about going through them all before the heat death of the universe.

So there's only so much you can do performance-wise, as even doubling the speed of your algorithm will only let you work with lists of one more element in the same time. For most applications, the real solution is to avoid having to enumerate all the subsets in the first place.

That said, there is a simple inductive way of thinking about subsets which makes defining a proper subset function easy without having to do any equality comparisons, which solves some of the problems with your implementation.

For the base case, the empty set has one subset: the empty set.

subsets [] = [[]]

For a set with at least one element (x:xs), we have the subsets which contain that element, and the ones that don't. We can get the subsets that don't contain x by recursively calling subsets xs, and we can get the rest by prepending x to those.

subsets (x:xs) = subsets xs ++ map (x:) (subsets xs)

The definition of subsequences in Data.List works on the same principle, but in a slightly more optimized way, which also returns the subsets in a different order and makes better use of sharing. However, as I said, enumerating the subsets of a list of length 30 is going to be slow no matter what, and your best bet is to try to avoid having to do it in the first place.

How to improve performence of this Haskell code?

Answers (1)

Related Questions