Reputation: 23945
I wrote the code below to list sub-sequence frequencies of a list of lists (results include the sub-sequence and the indexes of the lists where the sub-sequence occurs). Does anyone have any suggestions how to make it more concise and/or efficient?
Sample output:
*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[0,1,2,4]),([2,3],[0,1,3]),([3,5,7],[0,2,4]),([5,7],[0,2,4]),([2,3,5],[0,1]),([1,2],[0,3]),([1,2,3],[0,3]),([7,9],[2,3])]
import Data.List
import Data.Function (on)
--[[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
tupleCat x y = (fst x, sort $ nub $ snd x ++ snd y)
isInResult x result = case lookup x result of
Just a -> [a]
Nothing -> []
sInt xs = concat $ sInt' (csubs xs) 0 (length xs) where
csubs = map (filter (not . null) . concatMap inits . tails)
sInt' [] _ _ = []
sInt' (x:xs) count origLen =
let result = (zip (zip (replicate (length xs) count) [count+1..origLen])
$ map (\y -> intersect x y) xs)
in concatMap (\x -> let a = fst x in map (\y -> (y,a)) (snd x))
result : sInt' xs (count + 1) origLen
concatResults [] result = result
concatResults (x:xs) result =
let match = isInResult (fst x) result
newX = (fst x, [fst $ snd x, snd $ snd x])
in if not (null match)
then let match' = (fst x, head match)
newResult = deleteBy (\x -> (==match')) match' result
in concatResults xs (tupleCat match' newX : newResult)
else concatResults xs (newX : result)
combFreq xs =
filter (\x -> length (fst x) > 1)
$ reverse $ sortBy (compare `on` (length . snd)) $ concatResults (sInt xs) []
Upvotes: 1
Views: 358
Reputation: 1041
If all your lists are increasing (like they are in your example) the following should work (not a beauty as I'm a Haskell-newbie; comments about how to improve are very welcome):
import Control.Arrow (first, second)
compFreq ls = cF [] [] ls
where cF rs cs ls | all null ls = rs
| otherwise = cF (rs++rs') (cs'' ++ c ++ cs') ls'
where m = minimum $ map head $ filter (not . null) ls
ls' = map (\l -> if null l || m < head l then l
else tail l) ls
is = map snd $ filter ((==m) . head . fst) $ filter (not . null . fst) $ zip ls [0,1..]
c = if atLeastTwo is then [([m], is)] else []
fs = filter (\(vs, is') -> atLeastTwo $ combine is is') cs
cs' = map (\(vs, is') -> (vs++[m], combine is is')) fs
cs'' = map (second (filter (not . (`elem` is)))) cs
rs' = filter ok cs'
combine _ [] = []
combine [] _ = []
combine (i:is) (i':is') | i<i' = combine is (i':is')
| i>i' = combine (i:is) is'
| i==i' = i:combine is is'
atLeastTwo = not . null . drop 1
ok (js, ts) = atLeastTwo js && atLeastTwo ts
The idea is to process the lists by looking always at the minimal value m, which is removed from all lists to get ls'. The list of indices is tells where m was removed. The inner working function cF has two extra parameters: the list rs of results up to now and the list cs of current subsequences. The minimal value starts a new subsequence c if it occurs at least twice. cs' are subsequences which end with m and cs'' are those without m. The new results rs' all contain m as last element.
The output for your example is
[([1,2],[0,3]),([2,3],[0,1,3]),([1,2,3],[0,3]),([3,5],[0,1,2,4]),([2,3,5],[0,1]),([5,7],[0,2,4]),([3,5,7],[0,2,4]),([7,9],[2,3])]
Upvotes: 0
Reputation: 4253
Here is how I would go about doing it. I haven't compared it for performance,
and it is certainly naive. It enumerates all the contiguous subsequences for
each list and gathers them into a Map
. It should meet your requirement of
more concise though.
import Data.List as L
import Data.Map (Map)
import qualified Data.Map as M
nonEmptySubs :: [a] -> [[a]]
nonEmptySubs = filter (not . null)
. concatMap tails
. inits
makePairs :: (a -> [a]) -> [a] -> [(a, Int)]
makePairs f xs = concat $ zipWith app xs [0 .. ]
where app y i = zip (f y) (repeat i)
results :: (Ord a) => [[a]] -> Map [a] [Int]
results =
let ins acc (seq, ind) = M.insertWith (++) seq [ind] acc
-- Insert the index at the given sequence as a singleton list
in foldl' ins M.empty . makePairs nonEmptySubs
combFreq :: (Ord a) => [[a]] -> [([a], [Int])]
combFreq = filter (not . null . drop 1 . snd) -- Keep subseqs with more than 1 match
. filter (not . null . drop 1 . fst) -- keep subseqs longer than 1
. M.toList
. results
Just note that this version will give the same qualitative results, but it will not have the same ordering.
My biggest recommendation is to break things down more and leverage what you can from some of the standard libraries to do the tedious work. Notice that we can break a lot of the work down into separate stages and then compose those stages to get the final function.
Upvotes: 2