ajerneck
ajerneck

Reputation: 751

Converting a hierarchical data structure to a flat one in Haskell

I'm extracting some data from a text document organized like this:

- "day 1"
    - "Person 1"
        - "Bill 1"
    - "Person 2"
        - "Bill 2"

I can read this into a list of tuples that looks like this:

[(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]

Where the first item of each tuple indicates the heading level, and the second item the information associated with each heading.

My question is, how can I get a list of items that looks like this:

[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]

I.e. one list per deepest nested item, containing all the information from the headings above it. The closest I've gotten is this:

f [] = []
f (x:xs) = row:f rest where
leaves = takeWhile (\i -> fst i > fst x) xs
rest = dropWhile (\i -> fst i > fst x) xs
row = concat $ map (\i -> (snd x):[snd i]) leaves

Which gives me this:

[[["day 1"],["Intro 1"],["day 1"],["Bill 1"],["day 1"],["Intro 2"],["day 1"],["Bill 2"]]]

I'd like the solution to work for any number of levels.

P.s. I'm new to Haskell. I have a sense that I could/should use a tree to store the data, but I can't wrap my head around it. I also could not think of a better title.

Upvotes: 4

Views: 1350

Answers (2)

Gabriella Gonzalez
Gabriella Gonzalez

Reputation: 35089

Trees

You were right that you should probably use a tree to store the data. I'll copy how Data.Tree does it:

data Tree a = Node a (Forest a) deriving (Show)

type Forest a = [Tree a]

Building the Tree

Now we want to take your weakly typed list of tuples and convert it to a (slightly) stronger Tree of Strings. Any time you need to convert a weakly typed value and validate it before converting to a stronger type, you use a Parser:

type YourData = [(Int, [String])]

type Parser a = YourData -> Maybe (a, YourData)

The YourData type synonym represents the weak type that you are parsing. The a type variable is the value you are retrieving from the parse. Our Parser type returns a Maybe because the Parser might fail. To see why, the following input does not correspond to a valid Tree, since it is missing level 1 of the tree:

[(0, ["val1"]), (2, ["val2"])]

If the Parser does succeed, it also returns the unconsumed input so that subsequent parsing stages can use it.

Now, curiously enough, the above Parser type exactly matches a well known monad transformer stack:

StateT s Maybe a

You can see this if you expand out the underlying implementation of StateT:

StateT s Maybe a ~ s -> Maybe (a, s)

This means we can just define:

import Control.Monad.Trans.State.Strict

type Parser a = StateT [(Int, [String])] Maybe a

If we do this, we get a Monad, Applicative and Alternative instance for our Parser type for free. This makes it very easy to define parsers!

First, we must define a primitive parser that consumes a single node of the tree:

parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing

This is the only non-trivial piece of code we have to write, which, because it is total, handles all the following corner cases:

  • The list is empty
  • Your node has multiple values in it
  • The number in the tuple doesn't match the expected depth

The next part is where things get really elegant. We can then define two mutually recursive parsers, one for parsing a Tree, and the other for parsing a Forest:

import Control.Applicative

parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)

parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)

The first parser uses Applicative style, since StateT gave us an Applicative instance for free. However, I could also have used StateT's Monad instance instead, to give code that's more readable for an imperative programmer:

parseTree :: Int -> Parser (Tree String)
parseTree level = do
    str    <- parseElement level
    forest <- parseForest (level + 1)
    return $ Node str forest

But what about the many function? What's that doing? Let's look at its type:

many :: (Alternative f) => f a -> f [a]

It takes anything that returns a value and implements Applicative and instead calls it repeatedly to return a list of values instead. When we defined our Parser type in terms of State, we got an Alternative instance for free, so we can use the many function to convert something that parses a single Tree (i.e. parseTree), into something that parses a Forest (i.e. parseForest).

To use our Parser, we just rename an existing StateT function to make its purpose clear:

runParser :: Parser a -> [(Int, [String])] -> Maybe a runParser = evalStateT

Then we just run it!

>>> runParser (parseForest 0) [(0,["day 1"]),(1,["Person 1"]),(2,["Bill 1"]),(1,["Person 2"]),(2,["Bill 2"])]
Just [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]

That's just magic! Let's see what happens if we give it an invalid input:

>>> runParser (parseForest 0) [(0, ["val1"]), (2, ["val2"])]
Just [Node "val1" []]

It succeeds on a portion of the input! We can actually specify that it must consume the entire input by defining a parser that matches the end of the input:

eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing

Now let's try it:

>>> runParser (parseForest 0 >> eof) [(0, ["val1"]), (2, ["val2"])]
Nothing

Perfect!

Flattening the Tree

To answer your second question, we again solve the problem using mutually recursive functions:

flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest

flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _ -> map (a:) (flattenForest forest)

Let's try it!

>>> flattenForest [Node "day 1" [Node "Person 1" [Node "Bill 1" []],Node "Person 2" [Node "Bill 2" []]]]
[["day 1","Person 1","Bill 1"],["day 1","Person 2","Bill 2"]]

Now, technically I didn't have to use mutually recursive functions. I could have done a single recursive function. I was just following the definition of the Tree type from Data.Tree.

Conclusion

So in theory I could have shortened the code even further by skipping the intermediate Tree type and just parsing the flattened result directly, but I figured you might want to use the Tree-based representation for other purposes.

The key take home points from this are:

  • Learn Haskell abstractions to simplify your code
  • Always write total functions
  • Learn to use recursion effectively

If you do these, you will write robust and elegant code that exactly matches the problem.

Appendix

Here is the final code that incorporates everything I've said:

import Control.Applicative
import Control.Monad.Trans.State.Strict
import Data.Tree

type YourType = [(Int, [String])]

type Parser a = StateT [(Int, [String])] Maybe a

runParser :: Parser a -> [(Int, [String])] -> Maybe a
runParser = evalStateT

parseElement :: Int -> Parser String
parseElement level = StateT $ \list -> case list of
    []                  -> Nothing
    (level', strs):rest -> case strs of
        [str] ->
            if (level' == level)
            then Just (str, rest)
            else Nothing
        _     -> Nothing

parseTree :: Int -> Parser (Tree String)
parseTree level = Node <$> parseElement level <*> parseForest (level + 1)

parseForest :: Int -> Parser (Forest String)
parseForest level = many (parseTree level)

eof :: Parser ()
eof = StateT $ \list -> case list of
    [] -> Just ((), [])
    _  -> Nothing

flattenForest :: Forest a -> [[a]]
flattenForest forest = concatMap flattenTree forest

flattenTree :: Tree a -> [[a]]
flattenTree (Node a forest) = case forest of
    [] -> [[a]]
    _  -> map (a:) (flattenForest forest)

Upvotes: 5

Karolis Juodelė
Karolis Juodelė

Reputation: 3770

I seem to have solved it.

group :: [(Integer, [String])] -> [[String]]
group ((n, str):ls) = let
      (children, rest) = span (\(m, _) -> m > n) ls
      subgroups = map (str ++) $ group children
   in if null children then [str] ++ group rest
      else subgroups ++ group rest
group [] = []

I didn't test it much though.

The idea is to notice the recursive pattern. This function takes the first element (N, S) of the list and then gathers all entries in higher levels until another element at level N, into a list 'children'. If there are no children, we are at the top level and S forms the output. If there are some, S is appended to all of them.

As for why your algorithm doesn't work, the problem is mostly in row. Notice that you are not descending recursively.


Trees can be used too.

data Tree a = Node a [Tree a] deriving Show

listToTree :: [(Integer, [String])] -> [Tree [String]]
listToTree ((n, str):ls) = let
      (children, rest) = span (\(m, _) -> m > n) ls
      subtrees = listToTree children
   in Node str subtrees : listToTree rest
listToTree [] = []

treeToList :: [Tree [String]] -> [[String]]
treeToList (Node s ns:ts) = children ++ treeToList ts where
   children = if null ns then [s] else map (s++) (treeToList ns)
treeToList [] = []

The algorithm is essentially the same. The first half goes to the first function, the second half to the second.

Upvotes: 3

Related Questions