Reputation: 22742
I've recently written some Scala code which processes a String, finding all its sub-strings and retaining a list of those which are found in a dictionary. The start and end of the sub-strings within the overall string also have to be retained for later use, so the easiest way to do this seemed to be just to use nested for loops, something like this:
for (i <- 0 until word.length)
for (j <- i until word.length) {
val sub = word.substring(i, j + 1)
// lookup sub in dictionary here and add new match if found
}
As an exercise, I decided to have a go at doing the same thing in Haskell. It seems straightforward enough without the need for the sub-string indices - I can use something like this approach to get the sub-strings, then call a recursive function to accumulate the matches. But if I want the indices too it seems trickier.
How would I write a function which returns a list containing each continuous sub-string along with its start and end index within the "parent" string?
For example tokens "blah"
would give [("b",0,0), ("bl",0,1), ("bla",0,2), ...]
A great selection of answers and plenty of new things to explore. After messing about a bit, I've gone for the first answer, with Daniel's suggestion to allow the use of [0..]
.
data Token = Token String Int Int
continuousSubSeqs = filter (not . null) . concatMap tails . inits
tokenize xs = map (\(s, l) -> Token s (head l) (last l)) $ zip s ind
where s = continuousSubSeqs xs
ind = continuousSubSeqs [0..]
This seemed relatively easy to understand, given my limited Haskell knowledge.
Upvotes: 3
Views: 2491
Reputation: 9225
My version:
import Data.List
tokens =
map join . filter (not . null) . concatMap inits . tails . zip [0..]
where
join s@((i, _):t) =
(map snd s, i, foldl' (\_ i -> i) i (map fst t))
main =
putStrLn $ show $ tokens "blah"
-- [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
UPDATE:
import Control.Arrow
...
tokens =
map join . filter (not . null) . concatMap inits . tails . zip [0..] where
join s = (s', i, j) where
((i, j), s') = (first (head &&& last)) $ unzip s
...
Upvotes: 1
Reputation: 21
Another version easier to read left to right, similar to unix pipes
import Data.List
import Control.Category
tokens =
tailsWithIndex
>>> concatMap (\(i,str) -> zip (repeat i) (initsWithIndex str))
>>> map adjust
where
tailsWithIndex = tails >>> init >>> zip [0..]
initsWithIndex = inits >>> tail >>> zip [0..]
adjust (i, (j, str)) = (str, i, i+j)
Sample run
>tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
If concatMap
is lazy, then the whole computation is lazy and will be efficient, except for the use of Data.List functions instead of raw list access.
Upvotes: 1
Reputation: 8050
The two nested loops you wrote are an excellent starting point. That is, we can write a function tokens
that delegates its work to two recursive functions outer
and inner
that correspond to your loops:
type Token a = ([a], Int, Int)
tokens :: [a] -> [Token a]
tokens = outer 0
where
outer _ [] = []
outer i l@(_ : xs) = inner i [] l ++ outer (i + 1) xs
where
inner _ _ [] = []
inner j acc (x : xs) =
(acc ++ [x], i, j) : inner (j + 1) (acc ++ [x]) xs
Here, outer
iterates over the string and, for each start position within that string, calls inner
to collect all the segments that start at that position together with their end positions.
Although this function meets your requirements,
> tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
it is quite inefficient due to the repeated list concatenation. A more efficient version would accumulate its results in so-called difference lists:
type Token a = ([a], Int, Int)
tokens :: [a] -> [Token a]
tokens l = outer 0 l []
where
outer _ [] = id
outer i l@(_ : xs) = inner i id l . outer (i + 1) xs
where
inner _ _ [] = id
inner j acc (x : xs) =
((acc [x], i, j) :) . inner (j + 1) (acc . (x :)) xs
How to construct the dictionary of course depends on how you choose to represent it. Here's an approach that uses simple ordered association lists,
type Dict a = [([a], [(Int, Int)])]
empty :: Dict a
empty = []
update :: Ord a => Token a -> Dict a -> Dict a
update (xs, i, j) [] = [(xs, [(i, j)])]
update (xs, i, j) ((ys, ns) : dict) = case compare xs ys of
LT -> (xs, [(i, j)]) : (ys, ns) : dict
EQ -> (ys, (i, j) : ns) : dict
GT -> (ys, ns) : update (xs, i, j) dict
toDict :: Ord a => [a] -> Dict a
toDict = foldr update empty . tokens
but as your keys are strings, tries (a.k.a. prefix trees) are probably a better choice.
If it's efficient substring queries that you're after, I would recommend looking into suffix trees, although their implementation is somewhat involved. You may want to check out
and Bryan O'Sullivan's suffixtree package on Hackage.
Upvotes: 1
Reputation: 4976
import Data.List
continuousSubSeqs = filter (not . null) . concatMap inits . tails
tokens xs = map (\(s, l) -> (s, head l, last l)) $ zip s ind
where s = continuousSubSeqs xs
ind = continuousSubSeqs [0..length(xs)-1]
Works like this:
tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]
Upvotes: 2