matanox
matanox

Reputation: 13686

How to elegantly parse xml in clojure

I have this piece of code building up sentences from XML looking like follows. I wonder what might be an alternative code, that would be more readable after being hacked to work.

     (mapcat
        (fn [el]
           (map special-join
              (map
                  (fn [el] (map zip-xml/text (zip-xml/xml-> el :word)))
                  (zip-xml/xml-> el :sentence))))
        (zip-xml/xml-> root :document))

The above code is not very readable, given the repeat inline function definitions combined with the nested probing, but tearing them apart into standalone functions as in this official tutorial really doesn't make sense to me for such simple cases.

For completeness, here's the repeat XML structure that this is parsing

<document>
  <sentence id="1">
    <word id="1.1">Foo</w>
    <word id="1.2">bar</w>
  </sentence>
</document>

Upvotes: 1

Views: 4017

Answers (2)

Alan Thompson
Alan Thompson

Reputation: 29958

I do not like the way zippers work in Clojure, and I've not looked at clojure.zip/xml-zip or clojure.data.zip/xml-> (confusing that they are two separate libs!).

Instead, may I suggest you try out the tupelo.forest library? Here is an overview from the 2017 Clojure/Conj.

Below is a live solution using tupelo.forest. I added a second sentence to make it more interesting:

(dotest
  (with-forest (new-forest)
    (let [xml-str        (ts/quotes->double
                           "<document>
                              <sentence id='1'>
                                <word id='1.1'>foo</word>
                                <word id='1.2'>bar</word>
                              </sentence>
                              <sentence id='2'>
                                <word id='2.1'>beyond</word>
                                <word id='2.2'>all</word>
                                <word id='2.3'>recognition</word>
                              </sentence>
                            </document>")

          root-hid       (add-tree-xml xml-str)
          >>             (remove-whitespace-leaves)
          bush-no-blanks (hid->bush root-hid)
          sentence-hids  (find-hids root-hid [:document :sentence])
          sentences      (forv [sentence-hid sentence-hids]
                           (let [word-hids     (hid->kids sentence-hid)
                                 words         (mapv #(grab :value (hid->leaf %)) word-hids)
                                 sentence-text (str/join \space words)]
                             sentence-text))
          ]
      (is= bush-no-blanks
        [{:tag :document}
         [{:id "1", :tag :sentence}
          [{:id "1.1", :tag :word, :value "foo"}]
          [{:id "1.2", :tag :word, :value "bar"}]]
         [{:id "2", :tag :sentence}
          [{:id "2.1", :tag :word, :value "beyond"}]
          [{:id "2.2", :tag :word, :value "all"}]
          [{:id "2.3", :tag :word, :value "recognition"}]]])
      (is= sentences
        ["foo bar"
         "beyond all recognition"]))))

The idea is to find the hid (Hex ID, like a pointer) for each sentence. In the forv loop, we find the child nodes for each sentence, extract the :value, and joint into a string. The unit tests show the tree structure as parsed from XML (after deleting blank nodes) and the final result. Note that we ignore the id fields and use only the tree structure to understand the sentences.

Documentation for tupelo.forest is still a work in progress, but you can see many live examples here.

The Tupelo project lives on GitHub.\


Update

I have been thinking about the streaming data problem, and have added a new function proc-tree-enlive-lazy to enable lazy processing of large data sets. Here is an example:

  (let [xml-str (ts/quotes->double
                  "<document>
                     <sentence id='1'>
                       <word id='1.1'>foo</word>
                       <word id='1.2'>bar</word>
                     </sentence>
                     <sentence id='2'>
                       <word id='2.1'>beyond</word>
                       <word id='2.2'>all</word>
                       <word id='2.3'>recognition</word>
                     </sentence>
                   </document>")
    (let [enlive-tree-lazy     (clojure.data.xml/parse (StringReader. xml-str))
          doc-sentence-handler (fn [root-hid]
                                 (remove-whitespace-leaves)
                                 (let [sentence-hid  (only (find-hids root-hid [:document :sentence]))
                                       word-hids     (hid->kids sentence-hid)
                                       words         (mapv #(grab :value (hid->leaf %)) word-hids)
                                       sentence-text (str/join \space words)]
                                   sentence-text))
          result-sentences     (proc-tree-enlive-lazy enlive-tree-lazy
                                 [:document :sentence] doc-sentence-handler)]
      (is= result-sentences ["foo bar" "beyond all recognition"])) ))

The idea is that you process successive subtrees, in this case whenever you get a subtree path of [:document :sentence]. You pass in a handler function, which will receive the root-hid of a tupelo.forest tree. The return value of the handler is then placed onto an output lazy sequence returned to the caller.

Upvotes: 0

exupero
exupero

Reputation: 9426

Zippers may be overkill in this situation. clojure.xml/parse will give you a simple data structure representing the HTML.

(require '[clojure.xml :as xml] '[clojure.string :as string])

(def doc
  (->
"<document>
  <sentence id=\"1\">
    <word id=\"1.1\">
      Foo
    </word>
    <word id=\"1.2\">
      bar
    </word>
  </sentence>
</document>
" .getBytes java.io.ByteArrayInputStream. xml/parse))

Then you can use xml-seq to get all the <sentence> tags and their children, gathering the children's text content, trimming whitespace, and joining with spaces.

(->> doc
  xml-seq
  (filter (comp #{:sentence} :tag))
  (map :content)
  (map #(transduce
          (comp
            (mapcat :content)
            (map string/trim)
            (interpose " "))
          str %)))

Upvotes: 5

Related Questions