user3042405
user3042405

Reputation: 33

How to split a [String] in to [[String]] based on length

I'm trying to split a list of Strings in to a List of Lists of Strings so like in the title [String] -> [[String]]

This has to be done based on length of characters, so that the Lists in the output are no longer than 10. So if input was length 20 this would be broken down in to 2 lists and if length 21 in to 3 lists.

I'm not sure what to use to do this, I don't even know how to brake down a list in to a list of lists never mind based on certain length.

For example if the limit was 5 and the input was:

["abc","cd","abcd","ab"]

The output would be:

[["abc","cd"],["abcd"],["ab"]]

I'd like to be pointed in the right direction and what methods to use, list comprehension? recursion?

Upvotes: 0

Views: 537

Answers (4)

Marimuthu Madasamy
Marimuthu Madasamy

Reputation: 13551

Here is another approach. It is clear from the problem that the result is a list of lists and we need a running length and an inner list to keep track of how much we have accumulated (We use foldl' with these two as input). We then describe what we want which is basically:

  1. If the length of the current input string itself exceeds the input length, we ignore that string (you may change this if you want a different behavior).
  2. If the new length after we have added the length of the current string is within our input length, we add it to the current result list.
  3. If the new length exceeds the input length, we add the result so far to the output and start a new result list.
chunks len = reverse  . map reverse . snd . foldl' f (0, [[]]) where
  f (resSoFar@(lenSoFar, (currRes: acc)) curr
    | currLength > len = resSoFar -- ignore
    | newLen <= len    = (newLen, (curr: currRes):acc)
    | otherwise        = (currLength, [curr]:currRes:acc) 
    where
      newLen = lenSoFar + currLength
      currLength = length curr

Every time we add a result to the output list, we add it to the front hence we need reverse . map reverse at the end.

> chunks 5 ["abc","cd","abcd","ab"]
[["abc","cd"],["abcd"],["ab"]]

> chunks 5 ["abc","cd","abcdef","ab"]
[["abc","cd"],["ab"]]

Upvotes: 0

amar47shah
amar47shah

Reputation: 717

Here's an intuitive solution:

import Data.List (foldl')

breakup :: Int -> [[a]] -> [[[a]]]
breakup size = foldl' accumulate [[]]
  where accumulate broken l
         | length l > size = error "Breakup size too small."
         | sum (map length (last broken ++ [l])) <= size
               = init broken ++ [last broken ++ [l]]
         | otherwise = broken ++ [[l]]

Now, let's go through it line-by-line:

breakup :: Int -> [[a]] -> [[[a]]]

Since you hinted that you may want to generalize the function to accept different size limits, our type signature reflects this. We also generalize beyond [String] (that is, [[Char]]), since our problem is not specific to [[Char]], and could equally apply to any [[a]].

breakup size = foldl' accumulate [[]]

We're using a left fold because we want to transform a list, left-to-right, into our target, which will be a list of sub-lists. Even though we're not concerned with efficiency, we're using Data.List.foldl' instead of Prelude's own foldl because this is standard practice. You can read more about foldl vs. foldl' here.

Our folding function is called accumulate. It will consider a new item and decide whether to place it in the last-created sub-list or to start a new sub-list. To make that judgment, it uses the size we passed in. We start with an initial value of [[]], that is, a list with one empty sub-list.

Now the question is, how should you accumulate your target?

  where accumulate broken l

We're using broken to refer to our constructed target so far, and l (for "list") to refer to the next item to process. We'll use guards for the different cases:

         | length l > size = error "Breakup size too small."

We need to raise an error if the item surpasses the size limit on its own, since there's no way to place it in a sub-list that satisfies the size limit. (Alternatively, we could build a safe function by wrapping our return value in the Maybe monad, and that's something you should definitely try out on your own.)

         | sum (map length (last broken ++ [l])) <= size
               = init broken ++ [last broken ++ [l]]

The guard condition is sum (map length (last broken ++ [l])) <= size, and the return value for this guard is init broken ++ [last broken ++ [l]]. Translated into plain English, we might say, "If the item can fit in the last sub-list without going over the size limit, append it there."

         | otherwise = broken ++ [[l]]

On the other hand, if there isn't enough "room" in the last sub-list for this item, we start a new sub-list, containing only this item. When the accumulate helper is applied to the next item in the input list, it will decide whether to place that item in this sub-list or start yet another sub-list, following the same logic.

There you have it. Don't forget to import Data.List (foldl') up at the top. As another answer points out, this is not a performant solution if you plan to process 100,000 strings. However, I believe this solution is easier to read and understand. In many cases, readability is the more important optimization.

Thanks for the fun question. Good luck with Haskell, and happy coding!

Upvotes: 5

Helder Pereira
Helder Pereira

Reputation: 5756

You can do something like this:

splitByLen :: Int -> [String] -> [[String]]
splitByLen n s = go (zip s $ scanl1 (+) $ map length s) 0
  where go [] _ = []
        go xs prev = let (lst, rest) = span (\ (x, c) -> c - prev <= n) xs
                     in (map fst lst) : go rest (snd $ last lst)

And then:

*Main> splitByLen 5 ["abc","cd","abcd","ab"]
[["abc","cd"],["abcd"],["ab"]]

In case there is a string longer than n, this function will fail. Now, what you want to do in those cases depends on your requirements and that was not specified in your question.


[Update]

As requested by @amar47shah, I made a benchmark comparing his solution (breakup) with mine (splitByLen):

import Data.List
import Data.Time.Clock
import Control.DeepSeq
import System.Random

main :: IO ()
main = do
  s <- mapM (\ _ -> randomString 10) [1..10000]
  test "breakup    10000" $ breakup    10 s
  test "splitByLen 10000" $ splitByLen 10 s
  putStrLn ""
  r <- mapM (\ _ -> randomString 10) [1..100000]
  test "breakup    100000" $ breakup    10 r
  test "splitByLen 100000" $ splitByLen 10 r

test :: (NFData a) => String -> a -> IO ()
test s a = do time1 <- getCurrentTime
              time2 <- a `deepseq` getCurrentTime
              putStrLn $ s ++ ": " ++ show (diffUTCTime time2 time1)

randomString :: Int -> IO String
randomString n = do
  l <- randomRIO (1,n)
  mapM (\ _ -> randomRIO ('a', 'z')) [1..l]

Here are the results:

breakup    10000: 0.904012s
splitByLen 10000: 0.005966s

breakup    100000: 150.945322s
splitByLen 100000: 0.058658s

Upvotes: 3

ErikR
ErikR

Reputation: 52057

Here is an elementary approach. First, the type String doesn't matter, so we can define our function in terms of a general type a:

breakup :: [a] -> [[a]]

I'll illustrate with a limit of 3 instead of 10. It'll be obvious how to implement it with another limit.

The first pattern will handle lists which are of size >= 3 and the the second pattern handles all of the other cases:

breakup (a1 : a2 : a3 : as) = [a1, a2, a3] : breakup as
breakup as = [ as ]

It is important to have the patterns in this order. That way the second pattern will only be used when the first pattern does not match, i.e. when there are less than 3 elements in the list.

Examples of running this on some inputs:

breakup [1..5]       -> [ [1,2,3], [4,5] ]
breakup [1..4]       -> [ [1,2,3], [4] ]
breakup [1..2]       -> [ [1,2] ]
breakup [1..3]       -> [ [1,2,3], [] ]

We see these is an extra [] when we run the function on [1..3]. Fortunately this is easy to fix by inserting another rule before the last one:

breakup [] = []

The complete definition is:

breakup :: [a] -> [[a]]
breakup [] = []
breakup (a1 : a2 : a3 : as) = [a1, a2, a3] : breakup as
breakup as = [ as ]

Upvotes: -1

Related Questions