Haskell - How to parse XML response into Haskell datatypes?

Question

I'm a beginner trying to learn Haskell by doing some simple parsing problems. I have this XML files. It's a Goodreads' API response.


    
        true
        API_KEY
        search_search
    
    
        fantasy
        1
        20
        53297
        Goodreads
        0.15
        
            
                4640799
                640
                5640935
                90100
                1997
                6
                26
                4.46
                
                    3
                    Harry Potter and the Sorcerer's Stone (Harry Potter, #1)
                    
                        1077326
                        J.K. Rowling
                    
                    https://images.gr-assets.com/books/1474154022m/3.jpg
                    https://images.gr-assets.com/books/1474154022s/3.jpg
                
            
              ...
              ...
              ...
              ...

This is what I've got so far

{-# LANGUAGE DeriveGeneric #-}

module Lib where

import           Data.ByteString.Lazy (ByteString)
import           Data.Text            (Text)
import           GHC.Generics         (Generic)
import           Network.HTTP.Conduit (simpleHttp)
import           Text.Pretty.Simple   (pPrint)
import           Text.XML.Light

data GRequest = GRequest { authentication :: Text
                         , key            :: Text
                         , method         :: Text
                         }
              deriving (Generic, Show)

data GSearch = GSearch { query              :: Text
                       , results_start      :: Int
                       , results_end        :: Int
                       , total_results      :: Int
                       , source             :: Text
                       , query_time_seconds :: Float
                       , search_results     :: GResults
                       }
             deriving (Generic, Show)

data GResults = GResults { results :: [Work] }
              deriving (Generic, Show)


data Work = Work { id                       :: Int
                 , booksCount               :: Int
                 , ratingsCount             :: Int
                 , text_reviewsCount        :: Int
                 , originalPublicationYear  :: Int
                 , originalPublicationMonth :: Int
                 , originalPublicationDay   :: Int
                 , averageRating            :: Float
                 , bestBook                 :: Book
                 }
            deriving (Generic, Show)

data Book = Book { bID            :: Int
                 , bTitle         :: Text
                 , bAuthor        :: Author
                 , bImageURL      :: Maybe Text
                 , bSmallImageURL :: Maybe Text
                 }
            deriving (Generic, Show)


data Author = Author { authorID   :: Int
                     , authorName :: Text
                     }
              deriving (Generic, Show)


data GoodreadsResponse = GoodreadsResponse { request :: GRequest
                                           , search  :: GSearch
                                           }
                         deriving (Generic, Show)



main :: IO ()
main = do
  x <- simpleHttp apiString :: IO ByteString -- apiString is the API URL
  let listOfElements = onlyElems $ parseXML x
      filteredElements = concatMap (findElements (simpleName "work")) listOfElements
      simpleName s = QName s Nothing Nothing
  pPrint $ filteredElements

Ultimately what I want to do is put every aspect of (from .. ) into haskell workable types.

But I'm not sure how to go about doing that. I'm using the xml package to parse it into default types. But don't know how to put that into my custom types.

cronburg · Accepted Answer

It looks like the most pertinent types that you'll want to pattern match on can be found here. Namely you'll want to take the [Content] results that the parseXML function from Text.XML.Light.Input returns and pattern match on each individual Content instance, mostly ignoring the CRef data constructor and instead focusing on Elems because those are the XML tags that you care about (in addition to the Text constructors, which contain the non-XML strings found inside an XML tag).

For example you'll want to do something like the following:

#!/usr/bin/env stack
-- stack --resolver lts-12.24 --install-ghc runghc --package xml
import Text.XML.Light
import Data.Maybe

data MyXML =
    MyXML String [MyXML] -- Nested XML elements
  | Leaf  String         -- Leaves in the XML doc
  | Unit
  deriving (Show)

c2type :: Content -> Maybe MyXML
c2type (Text s) = Just $ Leaf $ cdData s
c2type (CRef _) = Nothing
c2type (Elem e) = Just $ MyXML (qName $ elName e) (mapMaybe c2type (elContent e))

main :: IO ()
main = do
  dat <- readFile "input.xml"
  let xml = parseXML dat
--  print xml
  print $ mapMaybe c2type xml

For the above code snippet, say input.xml contains the following XML:


  1
  2

Then running the example produces:

$ ./xml.hs 
[MyXML "work" [Leaf "
  ",MyXML "a" [Leaf "1"],Leaf "
  ",MyXML "b" [Leaf "2"],Leaf "
"],Leaf "
"]

The functions you'll probably find most interesting for your more extensive use case will probably include:

(qName . elName) -- Get the name of a tag in String format from an Elem
elContent -- Recursively extract the XML tag contents of an Elem
elAttribs -- Can check those 'type' attributes on some of your tags

In order to take a look at the general structure of the data types that the XML parser returns for your code, I strongly recommend e.g. uncommenting the print xml line in the code example above and inspecting the list of Contents it spits out on the command line. That alone should tell you exactly which fields you care about. For example this is what you get for my more minimal XML input example:

[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "
  ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "
  ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "
", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "
", cdLine = Just 4})]

Haskell - How to parse XML response into Haskell datatypes?

Answers (1)

Related Questions