Reputation: 22762

How to find all substrings of a String with start and end indices

I've recently written some Scala code which processes a String, finding all its sub-strings and retaining a list of those which are found in a dictionary. The start and end of the sub-strings within the overall string also have to be retained for later use, so the easiest way to do this seemed to be just to use nested for loops, something like this:

for (i <- 0 until word.length)
  for (j <- i until word.length) {
    val sub = word.substring(i, j + 1)
    // lookup sub in dictionary here and add new match if found
  }

As an exercise, I decided to have a go at doing the same thing in Haskell. It seems straightforward enough without the need for the sub-string indices - I can use something like this approach to get the sub-strings, then call a recursive function to accumulate the matches. But if I want the indices too it seems trickier.

How would I write a function which returns a list containing each continuous sub-string along with its start and end index within the "parent" string?

For example tokens "blah" would give [("b",0,0), ("bl",0,1), ("bla",0,2), ...]

Update

A great selection of answers and plenty of new things to explore. After messing about a bit, I've gone for the first answer, with Daniel's suggestion to allow the use of [0..].

data Token = Token String Int Int 

continuousSubSeqs = filter (not . null) . concatMap tails . inits

tokenize xs = map (\(s, l) -> Token s (head l) (last l)) $ zip s ind
    where s = continuousSubSeqs xs
          ind = continuousSubSeqs [0..]

This seemed relatively easy to understand, given my limited Haskell knowledge.

Upvotes: 3

Answers (4)

Victor Moroz

Reputation: 9225

My version:

import Data.List

tokens = 
  map join . filter (not . null) . concatMap inits . tails . zip [0..]
  where 
    join s@((i, _):t) = 
      (map snd s, i, foldl' (\_ i -> i) i (map fst t))

main =
  putStrLn $ show $ tokens "blah"

-- [("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

UPDATE:

import Control.Arrow

...

tokens = 
  map join . filter (not . null) . concatMap inits . tails . zip [0..] where 
    join s = (s', i, j) where 
      ((i, j), s') = (first (head &&& last)) $ unzip s

...

Upvotes: 1

kusimari

Reputation: 21

Another version easier to read left to right, similar to unix pipes

import Data.List
import Control.Category 

tokens = 
     tailsWithIndex
     >>> concatMap (\(i,str) -> zip (repeat i) (initsWithIndex str))
     >>> map adjust
     where
       tailsWithIndex = tails >>> init >>> zip [0..]
       initsWithIndex = inits >>> tail >>> zip [0..]
       adjust (i, (j, str)) = (str, i, i+j)

Sample run

>tokens "blah" 
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

If concatMap is lazy, then the whole computation is lazy and will be efficient, except for the use of Data.List functions instead of raw list access.

Upvotes: 1

Stefan Holdermans

Reputation: 8050

The two nested loops you wrote are an excellent starting point. That is, we can write a function tokens that delegates its work to two recursive functions outer and inner that correspond to your loops:

type Token a = ([a], Int, Int)

tokens :: [a] -> [Token a]
tokens = outer 0
  where
    outer _ []         = []
    outer i l@(_ : xs) = inner i [] l ++ outer (i + 1) xs
      where
        inner _ _ []         = []
        inner j acc (x : xs) =
          (acc ++ [x], i, j) : inner (j + 1) (acc ++ [x]) xs

Here, outer iterates over the string and, for each start position within that string, calls inner to collect all the segments that start at that position together with their end positions.

Although this function meets your requirements,

> tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

it is quite inefficient due to the repeated list concatenation. A more efficient version would accumulate its results in so-called difference lists:

type Token a = ([a], Int, Int)

tokens :: [a] -> [Token a]
tokens l = outer 0 l []
  where
    outer _ []         = id
    outer i l@(_ : xs) = inner i id l . outer (i + 1) xs
      where
        inner _ _ []         = id
        inner j acc (x : xs) =
          ((acc [x], i, j) :) . inner (j + 1) (acc . (x :)) xs

How to construct the dictionary of course depends on how you choose to represent it. Here's an approach that uses simple ordered association lists,

type Dict a = [([a], [(Int, Int)])]

empty :: Dict a
empty = []

update :: Ord a => Token a -> Dict a -> Dict a
update (xs, i, j) []                = [(xs, [(i, j)])]
update (xs, i, j) ((ys, ns) : dict) = case compare xs ys of
  LT -> (xs, [(i, j)]) : (ys, ns) : dict
  EQ -> (ys, (i, j) : ns) : dict
  GT -> (ys, ns) : update (xs, i, j) dict

toDict :: Ord a => [a] -> Dict a
toDict = foldr update empty . tokens

but as your keys are strings, tries (a.k.a. prefix trees) are probably a better choice.

If it's efficient substring queries that you're after, I would recommend looking into suffix trees, although their implementation is somewhat involved. You may want to check out

Robert Giegerich and Stefan Kurtz. A comparison of imperative and purely functional suffix tree constructions. Science of Computer Programming 25(2–3):187–218, 1995

and Bryan O'Sullivan's suffixtree package on Hackage.

Upvotes: 1

applicative_functor

Reputation: 4986

import Data.List

continuousSubSeqs = filter (not . null) . concatMap inits . tails

tokens xs = map (\(s, l) -> (s, head l, last l)) $ zip s ind
    where s   = continuousSubSeqs xs
          ind = continuousSubSeqs [0..length(xs)-1]

Works like this:

tokens "blah"
[("b",0,0),("bl",0,1),("bla",0,2),("blah",0,3),("l",1,1),("la",1,2),("lah",1,3),("a",2,2),("ah",2,3),("h",3,3)]

Upvotes: 2

How to find all substrings of a String with start and end indices

Update

Answers (4)

Related Questions