pistacchio
pistacchio

Reputation: 58883

Searching xml in Clojure

I have the following sample xml:

<data>
  <products>
    <product>
      <section>Red Section</section>
      <images>
        <image>img.jpg</image>
        <image>img2.jpg</image>
      </images>
    </product>
    <product>
      <section>Blue Section</section>
      <images>
        <image>img.jpg</image>
        <image>img3.jpg</image>
      </images>
    </product>
    <product>
      <section>Green Section</section>
      <images>
        <image>img.jpg</image>
        <image>img2.jpg</image>
      </images>
    </product>
  </products>
</data>

I know how to parse it in Clojure

(require '[clojure.xml :as xml])
(def x (xml/parse 'location/of/that/xml'))

This returns a nested map describing the xml

{:tag :data,
 :attrs nil,
 :content [
     {:tag :products,
      :attrs nil,
      :content [
          {:tag :product,
           :attrs nil,
           :content [] ..

This structure can of course be traversed with standard Clojure functions, but it may turn out to be really verbose, especially if compared to, for instance, querying it with XPath. Is there any helper to traverse and search such structure? How can I, for example

Thanks

Upvotes: 13

Views: 4322

Answers (5)

Alan Thompson
Alan Thompson

Reputation: 29958

The Tupelo library can easily solve problems like this using tupelo.forest tree data structure. Please see this question for more information. API docs can be found here.

Here we load your xml data and convert it first into enlive and then the native tree structure used by tupelo.forest. Libs & data def:

(ns tst.tupelo.forest-examples
  (:use tupelo.forest tupelo.test )
  (:require
    [clojure.data.xml :as dx]
    [clojure.java.io :as io]
    [clojure.set :as cs]
    [net.cgrand.enlive-html :as en-html]
    [schema.core :as s]
    [tupelo.core :as t]
    [tupelo.string :as ts]))
(t/refer-tupelo)

(def xml-str-prod "<data>
                    <products>
                      <product>
                        <section>Red Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img2.jpg</image>
                        </images>
                      </product>
                      <product>
                        <section>Blue Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img3.jpg</image>
                        </images>
                      </product>
                      <product>
                        <section>Green Section</section>
                        <images>
                          <image>img.jpg</image>
                          <image>img2.jpg</image>
                        </images>
                      </product>
                    </products>
                  </data> " )

and initialization code:

(dotest
  (with-forest (new-forest)
    (let [enlive-tree          (->> xml-str-prod
                                 java.io.StringReader.
                                 en-html/html-resource
                                 first)
          root-hid             (add-tree-enlive enlive-tree)
          tree-1               (hid->hiccup root-hid)

The hid suffix stands for "Hex ID" which is unique hex value that acts like a pointer to a node/leaf in the tree. At this stage we have just loaded the data in the forest data structure, creating tree-1 which looks like:

[:data
 [:tupelo.forest/raw "\n                    "]
 [:products
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Red Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img2.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Blue Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img3.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                      "]
  [:product
   [:tupelo.forest/raw "\n                        "]
   [:section "Green Section"]
   [:tupelo.forest/raw "\n                        "]
   [:images
    [:tupelo.forest/raw "\n                          "]
    [:image "img.jpg"]
    [:tupelo.forest/raw "\n                          "]
    [:image "img2.jpg"]
    [:tupelo.forest/raw "\n                        "]]
   [:tupelo.forest/raw "\n                      "]]
  [:tupelo.forest/raw "\n                    "]]
 [:tupelo.forest/raw "\n                   "]]

We next remove any blank strings with this code:

blank-leaf-hid?      (fn [hid] (and (leaf-hid? hid) ; ensure it is a leaf node
                                 (let [value (hid->value hid)]
                                      (and (string? value)
                                        (or (zero? (count value)) ; empty string
                                          (ts/whitespace? value)))))) ; all whitespace string

blank-leaf-hids      (keep-if blank-leaf-hid? (all-hids))
>>                   (apply remove-hid blank-leaf-hids)
tree-2               (hid->hiccup root-hid)

to produce a much nicer result tree (hiccup format)

[:data
 [:products
  [:product
   [:section "Red Section"]
   [:images [:image "img.jpg"] [:image "img2.jpg"]]]
  [:product
   [:section "Blue Section"]
   [:images [:image "img.jpg"] [:image "img3.jpg"]]]
  [:product
   [:section "Green Section"]
   [:images [:image "img.jpg"] [:image "img2.jpg"]]]]]

The following code then computes the answers to the three questions above:

product-hids         (find-hids root-hid [:** :product])
product-trees-hiccup (mapv hid->hiccup product-hids)

img2-paths           (find-paths-leaf root-hid [:data :products :product :images :image] "img2.jpg")
img2-prod-paths      (mapv #(drop-last 2 %) img2-paths)
img2-prod-hids       (mapv last img2-prod-paths)
img2-trees-hiccup    (mapv hid->hiccup img2-prod-hids)

red-sect-paths       (find-paths-leaf root-hid [:data :products :product :section] "Red Section")
red-prod-paths       (mapv #(drop-last 1 %) red-sect-paths)
red-prod-hids        (mapv last red-prod-paths)
red-trees-hiccup     (mapv hid->hiccup red-prod-hids)]

with results:

 (is= product-trees-hiccup
   [[:product
     [:section "Red Section"]
     [:images
      [:image "img.jpg"]
      [:image "img2.jpg"]]]
    [:product
     [:section "Blue Section"]
     [:images
      [:image "img.jpg"]
      [:image "img3.jpg"]]]
    [:product
     [:section "Green Section"]
     [:images
      [:image "img.jpg"]
      [:image "img2.jpg"]]]] )

(is= img2-trees-hiccup
  [[:product
    [:section "Red Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]
   [:product
    [:section "Green Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]])

(is= red-trees-hiccup
  [[:product
    [:section "Red Section"]
    [:images
     [:image "img.jpg"]
     [:image "img2.jpg"]]]]))))

The full example can be found in the forest-examples unit test.

Upvotes: 1

Terje Sten Bjerkseth
Terje Sten Bjerkseth

Reputation: 683

Here's an alternate version using data.zip, for all three usecases. I've found that xml-> and xml1-> has pretty powerful navigation built-in, with sub-queries in vectors.

;; [org.clojure/data.zip "0.1.1"]

(ns example.core
  (:require
   [clojure.zip :as zip]
   [clojure.xml :as xml]
   [clojure.data.zip.xml :refer [text xml-> xml1->]]))

(def data (zip/xml-zip (xml/parse "/tmp/products.xml")))

(let [all-products (xml-> data :products :product)
      red-section (xml1-> data :products :product [:section "Red Section"])
      img2 (xml-> data :products :product [:images [:image "img2.jpg"]])]
  {:all-products (map (fn [product] (xml1-> product :section text)) all-products)
   :red-section (xml1-> red-section :section text)
   :img2 (map (fn [product] (xml1-> product :section text)) img2)})

=> {:all-products ("Red Section" "Blue Section" "Green Section"),
    :red-section "Red Section",
    :img2 ("Red Section" "Green Section")}

Upvotes: 6

ponzao
ponzao

Reputation: 20934

Using Zippers from data.zip here is a solution for your second use case:

(ns core
  (:use clojure.data.zip.xml)
  (:require [clojure.zip :as zip]
            [clojure.xml :as xml]))

(def data (zip/xml-zip (xml/parse PATH)))
(def products (xml-> data :products :product))

(for [product products :let [image (xml-> product :images :image)]
                       :when (some (text= "img2.jpg") image)]
  {:section (xml1-> product :section text)
   :images (map text image)})
=> ({:section "Red Section", :images ("img.jpg" "img2.jpg")}
    {:section "Green Section", :images ("img.jpg" "img2.jpg")})

Upvotes: 10

Arthur Ulfeldt
Arthur Ulfeldt

Reputation: 91554

in many cases the thread-first macro along with clojures map and vector semantics are an adequate syntax for accessing xml. There are many cases where you want something more specific to xml (like an xpath library) though in many cases the existing language is nearly as concise with out adding any dependencies.

(pprint (-> (xml/parse "/tmp/xml") 
        :content first :content second :content first :content first))
"Blue Section"  

Upvotes: 1

Ankur
Ankur

Reputation: 33637

You can use a library like clj-xpath

Upvotes: 3

Related Questions