optimize/improve Haskell code to list sub-sequence frequencies

Question

I wrote the code below to list sub-sequence frequencies of a list of lists (results include the sub-sequence and the indexes of the lists where the sub-sequence occurs). Does anyone have any suggestions how to make it more concise and/or efficient?

Sample output:

*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[0,1,2,4]),([2,3],[0,1,3]),([3,5,7],[0,2,4]),([5,7],[0,2,4]),([2,3,5],[0,1]),([1,2],[0,3]),([1,2,3],[0,3]),([7,9],[2,3])]

import Data.List
import Data.Function (on)

--[[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]

tupleCat x y = (fst x, sort $ nub $ snd x ++ snd y)
isInResult x result = case lookup x result of
                        Just a  -> [a]
                        Nothing -> []

sInt xs = concat $ sInt' (csubs xs) 0 (length xs) where
    csubs = map (filter (not . null) . concatMap inits . tails)
    sInt' []     _     _       = []
    sInt' (x:xs) count origLen = 
        let result = (zip (zip (replicate (length xs) count) [count+1..origLen]) 
                 $ map (\y -> intersect x y) xs)
        in concatMap (\x -> let a = fst x in map (\y -> (y,a)) (snd x))
                 result : sInt' xs (count + 1) origLen

concatResults [] result     = result 
concatResults (x:xs) result = 
    let match = isInResult (fst x) result 
        newX  = (fst x, [fst $ snd x, snd $ snd x])
    in  if not (null match)
        then let match'    = (fst x, head match)
                 newResult = deleteBy (\x -> (==match')) match' result
             in concatResults xs (tupleCat match' newX : newResult)
        else concatResults xs (newX : result)

combFreq xs =
  filter (\x -> length (fst x) > 1)
  $ reverse $ sortBy (compare `on` (length . snd)) $ concatResults (sInt xs) []

sabauma · Accepted Answer

Here is how I would go about doing it. I haven't compared it for performance, and it is certainly naive. It enumerates all the contiguous subsequences for each list and gathers them into a Map. It should meet your requirement of more concise though.

import Data.List as L
import Data.Map (Map)
import qualified Data.Map as M

nonEmptySubs :: [a] -> [[a]]
nonEmptySubs = filter (not . null)
             . concatMap tails
             . inits

makePairs :: (a -> [a]) -> [a] -> [(a, Int)]
makePairs f xs = concat $ zipWith app xs [0 .. ]
    where app y i = zip (f y) (repeat i)

results :: (Ord a) => [[a]] -> Map [a] [Int]
results =
    let ins acc (seq, ind) = M.insertWith (++) seq [ind] acc
        -- Insert the index at the given sequence as a singleton list
    in foldl' ins M.empty . makePairs nonEmptySubs

combFreq :: (Ord a) => [[a]] -> [([a], [Int])]
combFreq = filter (not . null . drop 1 . snd) -- Keep subseqs with more than 1 match
         . filter (not . null . drop 1 . fst) -- keep subseqs longer than 1
         . M.toList
         . results

Just note that this version will give the same qualitative results, but it will not have the same ordering.

My biggest recommendation is to break things down more and leverage what you can from some of the standard libraries to do the tedious work. Notice that we can break a lot of the work down into separate stages and then compose those stages to get the final function.

optimize/improve Haskell code to list sub-sequence frequencies

Answers (2)

Related Questions