Filtering a hashmap by value in Clojure when values have compound data

Question

I'm trying to teach myself Clojure.

For a work-related project (to state the obvious, I'm not a professional programmer), I'm trying to combine a bunch of spreadsheets. The spreadsheets have comments that relate to financial transactions. Multiple comments (including across spreadsheets) can refer to the same transaction; each transaction has a unique serial number. I am therefore using the following data structure to represent the spreadsheets:

(def ss { :123 '([ "comment 1" "comment 2" ]
                 [ "comment 3" "comment 4" ]
                 [ "comment 5" ]),
          :456 '([ "happy days" "are here" ]
                 [ "again" ])})

This might be created from the following two spreadsheets:

+------------+------------+-----------+
| Trans. No. |   Cmt. A   |  Cmt. B   |
+------------+------------+-----------+
|        123 | comment 1  | comment 2 |
|        456 | happy days | are here  |
|        123 | comment 3  | comment 4 |
+------------+------------+-----------+

+-----------------+------------+
| Analyst Comment | Trans. No. |
+-----------------+------------+
| comment 5       |        123 |
| again           |        456 |
+-----------------+------------+

I have successfully written functions to create this data structure given a directory full of CSVs. I want to write two further functions:

;; FUNCTION 1 ==========================================================
;; Regex Spreadsheet -> Spreadsheet     ; "Spreadsheet" is like ss above 
;; Produces a Spreadsheet with ALL comments per transaction if ANY
;;     value matches the regex

; (defn filter-all [regex my-ss]     {}) ; stub

(defn filter-all [regex my-ss]           ; template
  (... my-ss))

(deftest filter-all-tests
  (is (= (filter-all #"1" ss) 
         { :123 '([ "comment 1" "comment 2" ]
                  [ "comment 3" "comment 4" ]
                  [ "comment 5" ]) })))

;; FUNCTION 2 ==========================================================
;; Regex Spreadsheet -> Spreadsheet     ; "Spreadsheet" is like ss above 
;; Produces a Spreadsheet with each transaction number that has at least
;;     one comment that matches the regex, but ONLY those comments that 
;;     match the regex

; (defn filter-matches [regex my-ss] {}) ; stub

(defn filter-matches [regex my-ss]       ; template
  (... my-ss))

(deftest filter-matches-tests
  (is (= (filter-matches #"1" ss) 
         { :123 '([ "comment 1" ]) })))

What I don't understand is the best way to get the regex far enough down into the vals for each key, given that they are strings nested inside vectors nested inside lists. I have tried using filter with nested applys or maps, but I'm confusing myself with the syntax and even if it works I don't know how to hang on to the keys in order to build up a new hashmap.

I have also tried using destructuring within the filter function, but there too I'm confusing myself and I also think I have to "lift" the functions across the nested data (I think that's the term—like applicatives and monads in Haskell).

Can somebody please suggest the best approach to filtering this data structure? As a separate matter, I would be glad to have feedback on whether this is a sensible data structure for my purposes, but I would like to learn how to solve this problem as it currently exists, if only for learning purposes.

Thanks much.

T.Gounelle · Accepted Answer

Here a solution with your data structure. filter takes a predicate function. Into that function you can actually get in the data structure to test whatever you need. Here, flatten helps to remove the list of vector of comments.

(defn filter-all [regex my-ss]
  (into {} (filter (fn [[k v]] ; map entry can be destructured into a vector
                     ; flatten the vectors into one sequence
                     ; some return true if there is a match on the comments 
                     (some #(re-matches regex %) (flatten v)))
                   my-ss)))

user> (filter-all #".*3.*" ss)
{:123 (["comment 1" "comment 2"] ["comment 3" "comment 4"] ["comment 5"])}

For filter-matches the logic is different : you want to build a new map with some parts of the values. reduce can help doing that :

(defn filter-matches [regex my-ss]
  (reduce (fn [m [k v]]   ; m is the result map (accumulator)
            (let [matches (filter #(re-matches regex %) (flatten v))]
              (when (seq matches)
                (assoc m k (vec matches)))))
          {}
          my-ss))

user> (filter-matches #".*days.*" ss)
{:456 ["happy days"]}

For the data structure itself, if there is no use to keep the nested vectors into the list for each entry, you can simplify with {:123 ["comment1" "comments 2"] ...}, but it won't drastically simplify the above functions.

Filtering a hashmap by value in Clojure when values have compound data

Answers (2)

Related Questions