epsilonhalbe
epsilonhalbe

Reputation: 15967

stream parsing xml without knowing the ordering of child tags

I have to parse some xml, i decided to use xml-conduit for that task and use the streaming part of it.

The structure of the xml is given by an xsd-file that contains elements and how often they may occur. But not in which order they are expected.

How do I parse all possible reorderings of children of an xml-structure using Text.XML.Stream.Parse ?

The problem

Say we have an xml description like

    Root 
    /  \
   A    B

then both <Root><A>atext</A><B>btext</B></Root> and <Root><B>btext</B><A>atext</A></Root> are valid instances of this xml-structure. But parsing in the streaming setup needs an ordering to succeed.

I thought of using something like parseRoot1 <|> parseRoot2 but then I have to implement the Alternative instance and write all the possibilities by hand, which I really don't want to.

Here is a minimal sample haskell program.

Example.hs

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE Rank2Types #-}

module Main where

import           Control.Exception
import           Control.Monad.Trans.Resource (MonadThrow)
import           Text.XML.Stream.Parse
import           Data.Monoid ((<>))
import           Data.Maybe
import           Data.Text (Text)
import           Data.XML.Types (Event)
import           Data.Conduit (ConduitM, Consumer, yield, ($=), ($$))

data Root = Root {a :: A, b :: B} deriving (Show, Eq)
data A = A Text deriving (Show, Eq)
data B = B Text deriving (Show, Eq)

ex1, ex2 :: Text
ex1 = "<Root>"<>
        "<A>Atest</A>"<>
        "<B>Btest</B>"<>
      "</Root>"
ex2 = "<Root>"<>
        "<B>Btest</B>"<>
        "<A>Atest</A>"<>
      "</Root>"

ex :: Root
ex = Root {a = A "Atest", b = B "Btest"}

parseA :: MonadThrow m => ConduitM Event o m (Maybe A)
parseA = tagIgnoreAttrs "A"
            $ do result <- content
                 return (A $ result)

parseB :: MonadThrow m => ConduitM Event o m (Maybe B)
parseB = tagIgnoreAttrs "B"
            $ do result <- content
                 return (B result)

parseRoot1 :: MonadThrow m => ConduitM Event o m (Maybe Root)
parseRoot1 = tagIgnoreAttrs "Root" $ do
                 a' <- fromMaybe (error "error parsing A") <$> parseA
                 b' <- fromMaybe (error "error parsing B") <$> parseB
                 return $ Root{a = a', b = b'}

parseRoot2 :: MonadThrow m => ConduitM Event o m (Maybe Root)
parseRoot2 = tagIgnoreAttrs "Root" $ do
                 b' <- fromMaybe (error "error parsing B") <$> parseB
                 a' <- fromMaybe (error "error parsing A") <$> parseA
                 return $ Root{a = a', b = b'}

parseTxt :: Consumer Event (Either SomeException) (Maybe a)
                          -> Text 
                          -> Either SomeException (Maybe a)
parseTxt p inTxt = yield inTxt
                  $= parseText' def
                  $$ p

main :: IO ()
main = do putStrLn "Poor Mans Test Suite"
          putStrLn "===================="
          putStrLn "test1 Root -> A - B " -- works
          print $ parseTxt parseRoot1 ex1
          putStrLn "test1 Root -> B - A " -- fails
          print $ parseTxt parseRoot1 ex2
          putStrLn "test2 Root -> A - B " -- fails
          print $ parseTxt parseRoot2 ex1
          putStrLn "test2 Root -> B - A " -- works again
          print $ parseTxt parseRoot2 ex2

note

example.cabal

[...]
  build-depends: base >=4.8 && <4.9
               , conduit
               , resourcet
               , text
               , xml-conduit
               , xml-types
[...]

Upvotes: 3

Views: 215

Answers (1)

ErikR
ErikR

Reputation: 52029

Here's my idea...

First some definitions:

 {-# LANGUAGE OverloadedStrings, MultiWayIf #-}

 import Control.Monad.Trans.Resource
 import Data.Conduit
 import Data.Text (Text, unpack)
 import Data.XML.Types
 import Text.XML.Stream.Parse

 data SumType = A Text | B Text | C Text

We start with a conduit which accepts either an A or B tag, ignores the attributes and returns the name and content:

 parseAorB :: MonadThrow m => ConduitM Event o m (Maybe (Name, Text))
 parseAorB =
   tag (\n -> if (n == "A" || n == "B") then Just n else Nothing) -- accept either A or B
       (\n -> return n)                                           -- ignore attributes
       (\n -> do c <- content; return (n,c))                      -- extract content

Then we use that to write a conduit which parses two tags, makes sure that one is an A and the other is a B:

 parseAB :: MonadThrow m => ConduitM Event o m (Maybe (SumType, SumType))
 parseAB = do
   t1 <- parseAorB
   case t1 of
     Nothing      -> return Nothing
     Just (n1,c1) -> do
       t2 <- parseAorB
       case t2 of
         Nothing -> return Nothing
         Just (n2,c2) -> do
           if | "A" == n1 && "B" == n2 -> return $ Just (A c1, B c2)
              | "A" == n2 && "B" == n1 -> return $ Just (A c2, B c1)
              | otherwise              -> return Nothing

Update

You can reduce the boilerplate in parseAB by using the MaybeT transformer:

import Control.Monad.Trans.Maybe
import Control.Monad.Trans

parseAB' :: MonadThrow m => MaybeT (ConduitM Event o m) (SumType, SumType)
parseAB' = do
 (n1, c1) <- MaybeT parseAorB
 (n2, c2) <- MaybeT parseAorB
 if | "A" == n1 && "B" == n2 -> return (A c1, B c2)
    | "A" == n2 && "B" == n1 -> return (A c2, B c1)
    | otherwise              -> MaybeT $ return Nothing

And if you have several constructors, I'd consider doing something like this:

allkids = do
  kids <- many parseAorB
  let sorted = sort kids -- automatically sorts by name
  if map fst kids == [ "A", "B", "C", "D", "E", "F", "G", "H"]
    then let [ca, cb, cc, cd, ce, cf, cg, ch] = map snd kids
         in return (A ca, B cb, C cc, D cd, E ce, F cf, G cg, H ch)
    else ...error...

The many combinator comes from Tet.XML.Stream.Parse.

Upvotes: 1

Related Questions