Reputation: 921
I'm a beginner trying to learn Haskell by doing some simple parsing problems. I have this XML files. It's a Goodreads' API response.
<GoodreadsResponse>
<Request>
<authentication>true</authentication>
<key>API_KEY</key>
<method>search_search</method>
</Request>
<search>
<query>fantasy</query>
<results-start>1</results-start>
<results-end>20</results-end>
<total-results>53297</total-results>
<source>Goodreads</source>
<query-time-seconds>0.15</query-time-seconds>
<results>
<work>
<id type="integer">4640799</id>
<books_count type="integer">640</books_count>
<ratings_count type="integer">5640935</ratings_count>
<text_reviews_count type="integer">90100</text_reviews_count>
<original_publication_year type="integer">1997</original_publication_year>
<original_publication_month type="integer">6</original_publication_month>
<original_publication_day type="integer">26</original_publication_day>
<average_rating>4.46</average_rating>
<best_book type="Book">
<id type="integer">3</id>
<title>Harry Potter and the Sorcerer's Stone (Harry Potter, #1)</title>
<author>
<id type="integer">1077326</id>
<name>J.K. Rowling</name>
</author>
<image_url>https://images.gr-assets.com/books/1474154022m/3.jpg</image_url>
<small_image_url>https://images.gr-assets.com/books/1474154022s/3.jpg</small_image_url>
</best_book>
</work>
...
...
...
...
This is what I've got so far
{-# LANGUAGE DeriveGeneric #-}
module Lib where
import Data.ByteString.Lazy (ByteString)
import Data.Text (Text)
import GHC.Generics (Generic)
import Network.HTTP.Conduit (simpleHttp)
import Text.Pretty.Simple (pPrint)
import Text.XML.Light
data GRequest = GRequest { authentication :: Text
, key :: Text
, method :: Text
}
deriving (Generic, Show)
data GSearch = GSearch { query :: Text
, results_start :: Int
, results_end :: Int
, total_results :: Int
, source :: Text
, query_time_seconds :: Float
, search_results :: GResults
}
deriving (Generic, Show)
data GResults = GResults { results :: [Work] }
deriving (Generic, Show)
data Work = Work { id :: Int
, booksCount :: Int
, ratingsCount :: Int
, text_reviewsCount :: Int
, originalPublicationYear :: Int
, originalPublicationMonth :: Int
, originalPublicationDay :: Int
, averageRating :: Float
, bestBook :: Book
}
deriving (Generic, Show)
data Book = Book { bID :: Int
, bTitle :: Text
, bAuthor :: Author
, bImageURL :: Maybe Text
, bSmallImageURL :: Maybe Text
}
deriving (Generic, Show)
data Author = Author { authorID :: Int
, authorName :: Text
}
deriving (Generic, Show)
data GoodreadsResponse = GoodreadsResponse { request :: GRequest
, search :: GSearch
}
deriving (Generic, Show)
main :: IO ()
main = do
x <- simpleHttp apiString :: IO ByteString -- apiString is the API URL
let listOfElements = onlyElems $ parseXML x
filteredElements = concatMap (findElements (simpleName "work")) listOfElements
simpleName s = QName s Nothing Nothing
pPrint $ filteredElements
Ultimately what I want to do is put every aspect of <work></work>
(from <results> .. </results>
) into haskell workable types.
But I'm not sure how to go about doing that. I'm using the xml package to parse it into default types. But don't know how to put that into my custom types.
Upvotes: 0
Views: 498
Reputation: 891
It looks like the most pertinent types that you'll want to pattern match on can be found here. Namely you'll want to take the [Content]
results that the parseXML
function from Text.XML.Light.Input
returns and pattern match on each individual Content
instance, mostly ignoring the CRef
data constructor and instead focusing on Elem
s because those are the XML tags that you care about (in addition to the Text
constructors, which contain the non-XML strings found inside an XML tag).
For example you'll want to do something like the following:
#!/usr/bin/env stack
-- stack --resolver lts-12.24 --install-ghc runghc --package xml
import Text.XML.Light
import Data.Maybe
data MyXML =
MyXML String [MyXML] -- Nested XML elements
| Leaf String -- Leaves in the XML doc
| Unit
deriving (Show)
c2type :: Content -> Maybe MyXML
c2type (Text s) = Just $ Leaf $ cdData s
c2type (CRef _) = Nothing
c2type (Elem e) = Just $ MyXML (qName $ elName e) (mapMaybe c2type (elContent e))
main :: IO ()
main = do
dat <- readFile "input.xml"
let xml = parseXML dat
-- print xml
print $ mapMaybe c2type xml
For the above code snippet, say input.xml
contains the following XML:
<work>
<a>1</a>
<b>2</b>
</work>
Then running the example produces:
$ ./xml.hs
[MyXML "work" [Leaf "\n ",MyXML "a" [Leaf "1"],Leaf "\n ",MyXML "b" [Leaf "2"],Leaf "\n"],Leaf "\n"]
The functions you'll probably find most interesting for your more extensive use case will probably include:
(qName . elName) -- Get the name of a tag in String format from an Elem
elContent -- Recursively extract the XML tag contents of an Elem
elAttribs -- Can check those 'type' attributes on some of your tags
In order to take a look at the general structure of the data types that the XML parser returns for your code, I strongly recommend e.g. uncommenting the print xml
line in the code example above and inspecting the list of Contents it spits out on the command line. That alone should tell you exactly which fields you care about. For example this is what you get for my more minimal XML input example:
[Elem (Element {elName = QName {qName = "work", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 1}),Elem (Element {elName = QName {qName = "a", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "1", cdLine = Just 2})], elLine = Just 2}),Text (CData {cdVerbatim = CDataText, cdData = "\n ", cdLine = Just 2}),Elem (Element {elName = QName {qName = "b", qURI = Nothing, qPrefix = Nothing}, elAttribs = [], elContent = [Text (CData {cdVerbatim = CDataText, cdData = "2", cdLine = Just 3})], elLine = Just 3}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 3})], elLine = Just 1}),Text (CData {cdVerbatim = CDataText, cdData = "\n", cdLine = Just 4})]
Upvotes: 1