Reputation: 13

Splitting string into type in Haskell

I need to create a parse function. I am new in Haskell and I am interesting can my thinking be implemented in Haskell using only GHC base functions.

So the problem is : I have so message in string with coordinates and value like (x: 01, 01, ... y:01, 02,: v: X, Y, Z) and i need to parse it type like ([Char], [Int], [Int]).

In language like C , I would create loop and go from start and would check and then put it in there arrays but I am afraid this would not work in Haskell. Can someone give a hint on a approachable solutions to this problem?

Upvotes: 1

Answers (1)

Jon Purdy

Reputation: 54971

If you’re accustomed to imperative programming with loops, you can actually do a fairly literal translation of an imperative solution to Haskell using direct recursion.

Bear in mind, this isn’t the easiest or best way to arrive at a working solution, but it’s good to learn the technique so that you understand what more idiomatic solutions are abstracting away for you.

The basic principle is to replace each loop with a recursive function, and replace each mutable variable with an accumulator parameter to that function. Where you would modify the variable within an iteration of the loop, just make a new variable; where you would modify it between iterations of the loop, call the looping function with a different argument in place of that parameter.

For a simple example, consider computing the sum of a list of integers. In C, that might be written like this:

struct ListInt { int head; struct ListInt *tail; }

int total(ListInt const *list) {
    int acc = 0;
    ListInt const *xs = list;
    while (xs != NULL) {
        acc += xs->head;
        xs = xs->tail;
    }
    return acc;
}

We can translate that literally to low-level Haskell:

total :: [Int] -> Int
total list
  = loop
    0     -- acc = 0
    list  -- xs = list

  where

    loop
      :: Int    -- int acc;
      -> [Int]  -- ListInt const *xs;
      -> Int

    loop acc xs                 -- loop:

      | not (null xs) = let     -- if (xs != NULL) {
        acc' = acc + head xs    --   acc += xs->head;
        xs' = tail xs           --   xs = xs->tail;
        in loop acc' xs'        --   goto loop;
                                -- } else {
      | otherwise = acc         --   return acc;
                                -- }

The outer function total sets up the initial state, and the inner function loop handles the iteration over the input. In this case, total immediately returns after the loop, but if there were some more code after the loop to process the results, that would go in total:

total list = let
  result = loop 0 list
  in someAdditionalProcessing result

It’s extremely common in Haskell for a helper function to accumulate a list of results by prepending them to the beginning of an accumulator list with :, and then reversing this list after the loop, because appending a value to the end of a list is much more costly. You can think of this pattern as using a list as a stack, where : is the “push” operation.

Also, straight away, we can make some simple improvements. First, the accessor functions head and tail may throw an error if our code is wrong and we call them on empty lists, just like accessing a head or tail member of a NULL pointer (although an exception is clearer than a segfault!), so we can simplify it and make it safer use pattern matching instead of guards & head/tail:

loop :: Int -> [Int] -> Int
loop acc [] = acc
loop acc (h : t) = loop (acc + h) t

Finally, this pattern of recursion happens to be a fold: there’s an initial value of the accumulator, updated for each element of the input, with no complex recursion. So the whole thing can be expressed with foldl':

total :: [Int] -> Int
total list = foldl' (\ acc h -> acc + h) 0 list

And then abbreviated:

total = foldl' (+) 0

So, for parsing your format, you can follow a similar approach: instead of a list of integers, you have a list of characters, and instead of a single integer result, you have a compound data type, but the overall structure is very similar:

parse :: String -> ([Char], [Int], [Int])
parse input = let
  (…, …, …) = loop ([], [], []) input
  in …

  where
    loop (…, …, …) (c : rest) = …  -- What to do for each character.
    loop (…, …, …) []         = …  -- What to do at end of input.

If there are different sub-parsers, where you would use a state machine in an imperative language, you can make the accumulator include a data type for the different states. For example, here’s a parser for numbers separated by spaces:

import Data.Char (isSpace, isDigit)

data ParseState
  = Space
  | Number [Char]  -- Digit accumulator

numbers :: String -> [Int]
numbers input = loop (Space, []) input
  where

    loop :: (ParseState, [Int]) -> [Char] -> [Int]

    loop (Space, acc) (c : rest)
      | isSpace c = loop (Space, acc) rest       -- Ignore space.
      | isDigit c = loop (Number [c], acc) rest  -- Push digit.
      | otherwise = error "expected space or digit"

    loop (Number ds, acc) (c : rest)
      | isDigit c = loop (Number (c : ds), acc) rest  -- Push digit.
      | otherwise
        = loop
          (Space, read (reverse ds) : acc)  -- Save number, expect space.
          (c : rest)                        -- Repeat loop for same char.

    loop (Number ds, acc) [] = let
      acc' = read (reverse ds) : acc  -- Save final number.
      in reverse acc'                 -- Return final result.

    loop (Space, acc) [] = reverse acc  -- Return final result.

Of course, as you may be able to tell, this approach quickly becomes very complicated! Even if you write your code very compactly, or express it as a fold, if you’re working at the level of individual characters and parser state machines, it will take a lot of code to express your meaning, and there are many opportunities for error. A better approach is to consider the data flow at work here, and put together the parser from high-level components.

For example, the intent of the above parser is to do the following:

Split the input on whitespace
For each split, read it as an integer

And that can be expressed very directly with the words and map functions:

numbers :: String -> [Int]
numbers input = map read (words input)

One readable line instead of dozens! Clearly this approach is better. Consider how you can express the format you’re trying to parse in this style. If you want to avoid libraries like split, you can still write a function to split a string on separators using base functions like break, span, or takeWhile; then you can use that to split the input into records, and split each record into fields, and parse fields as integers or textual names accordingly.

But the preferred approach for parsing in Haskell is not to manually split up input at all, but to use parser combinator libraries like megaparsec. There are parser combinators in base too, under Text.ParserCombinators.ReadP. With those, you can express a parser in the abstract, without talking about splitting up input at all, by just combining subparsers with standard interfaces (Functor, Applicative, Alternative, and Monad), for example:

import Data.Char (isDigit)

import Text.ParserCombinators.ReadP
  ( endBy
  , eof
  , munch1
  , readP_to_S
  , skipSpaces
  , skipSpaces
  )

numbers :: String -> [Int]
numbers = fst . head . readP_to_S onlyNumbersP
  where
    onlyNumbersP :: ReadP [Int]
    onlyNumbersP = skipSpaces *> numbersP <* eof

    numbersP :: ReadP [Int]
    numbersP = numberP `endBy` skipSpaces

    numberP :: ReadP Int
    numberP = read <$> munch1 isDigit

This is the approach I would recommend in your case. Parser combinators are also an excellent way to get comfortable using applicatives and monads in practice.

Upvotes: 3

Splitting string into type in Haskell

Answers (1)

Related Questions