Sal
Sal

Reputation: 4326

Parsing elements with namespace etc. in XML

This question is about how to parse xml content with xmlns attributes etc. I wrote code to parse it which works. I will appreciate pointers on whether it can be done better.

I have an XML file test.xml as below:

<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Body>
<SomeResponse xmlns="https://testsomestuff.org/API/WS/">
 <SomeResult>
&lt;html&gt;
    &lt;head&gt;
        &lt;title&gt;My &lt;b&gt;Title&lt;/b&gt;&lt;/title&gt;
    &lt;/head&gt;
    &lt;body&gt;
        &lt;p&gt;Foo bar baz&lt;/p&gt;
    &lt;/body&gt;
&lt;/html&gt;
 </SomeResult>
</SomeResponse>
</soap:Body></soap:Envelope>

I wrote the code to parse the "SomeResult" content using xml-conduit:

{-# LANGUAGE OverloadedStrings #-}
import Prelude hiding (readFile)
import Text.XML
import Text.XML.Cursor
import qualified Data.Text as T
import Data.Text.Lazy.Builder (toLazyText)
import Data.Text.Lazy (fromStrict)

main :: IO ()
main = do
    doc <- readFile def "test.xml"
    let cursor = fromDocument doc
        res = fromStrict $ T.concat $ child cursor >>= laxElement "Body" >>= child >>= laxElement "SomeResponse" >>= child >>= laxElement "SomeResult" >>= descendant >>= content
        pres = parseText_ def res
        cursor2 = fromDocument pres
        res2 = child cursor2 >>= element "head" >>= child >>= element "title" >>= descendant >>= content
    print $ res2

Output in ghci: parses correctly:

*Main> main
["My ","Title"]

Is laxElement approach to locate the SomeResult content good way to do it? If there is a better way, I will very much appreciate pointers on this.

Also, I need to do http encoding in reverse direction (when building a request for the response above) where the inner body is escaped (like under SomeResult in text.xml). Is that something that is taken care of by default when building request using Text.XML, or do I have to convert the inner body to escaped http explicitly by using something like html-entities ?

Upvotes: 1

Views: 211

Answers (1)

Pierre R
Pierre R

Reputation: 216

Together with xml-conduit, I would suggest the use of a tiny 'lens' package such as xml-html-conduit-lens or xml-lens (both are quite similar but I have chosen the first after a quick browse at the source). Namespace is supported (see this issue)

You can look at one of my experimental project if you need a more concrete example. From that project, here is a traversal to get the information of a specific machine from the VCloud API:

fetchVM :: AsXmlDocument t => Text -> Traversal' t Element
fetchVM n = xml...ovfNode "VirtualSystem".attributed (ix (nsName ovfNS "id").only n)

You can then combine traversals as such:

vmId = raw ^. responseBody . fetchVM vmName . fetchVmId.text

Look at how ovhNode or nsName is defined to see how I handle namespace.

Here is another interesting article on the subject: https://www.schoolofhaskell.com/user/chad/snippets/random-code-snippets/xml-conduit-lens

Another tip is to stick with 'xml-conduit' (at least for now). Some have suggested taggy as a replacement but unfortunately it is currently not in an active development cycle (see https://github.com/alpmestan/taggy/issues/14)

I hope it helps.

Upvotes: 1

Related Questions