Reputation: 551
I'm trying to teach myself Clojure.
For a work-related project (to state the obvious, I'm not a professional programmer), I'm trying to combine a bunch of spreadsheets. The spreadsheets have comments that relate to financial transactions. Multiple comments (including across spreadsheets) can refer to the same transaction; each transaction has a unique serial number. I am therefore using the following data structure to represent the spreadsheets:
(def ss { :123 '([ "comment 1" "comment 2" ]
[ "comment 3" "comment 4" ]
[ "comment 5" ]),
:456 '([ "happy days" "are here" ]
[ "again" ])})
This might be created from the following two spreadsheets:
+------------+------------+-----------+
| Trans. No. | Cmt. A | Cmt. B |
+------------+------------+-----------+
| 123 | comment 1 | comment 2 |
| 456 | happy days | are here |
| 123 | comment 3 | comment 4 |
+------------+------------+-----------+
+-----------------+------------+
| Analyst Comment | Trans. No. |
+-----------------+------------+
| comment 5 | 123 |
| again | 456 |
+-----------------+------------+
I have successfully written functions to create this data structure given a directory full of CSVs. I want to write two further functions:
;; FUNCTION 1 ==========================================================
;; Regex Spreadsheet -> Spreadsheet ; "Spreadsheet" is like ss above
;; Produces a Spreadsheet with ALL comments per transaction if ANY
;; value matches the regex
; (defn filter-all [regex my-ss] {}) ; stub
(defn filter-all [regex my-ss] ; template
(... my-ss))
(deftest filter-all-tests
(is (= (filter-all #"1" ss)
{ :123 '([ "comment 1" "comment 2" ]
[ "comment 3" "comment 4" ]
[ "comment 5" ]) })))
;; FUNCTION 2 ==========================================================
;; Regex Spreadsheet -> Spreadsheet ; "Spreadsheet" is like ss above
;; Produces a Spreadsheet with each transaction number that has at least
;; one comment that matches the regex, but ONLY those comments that
;; match the regex
; (defn filter-matches [regex my-ss] {}) ; stub
(defn filter-matches [regex my-ss] ; template
(... my-ss))
(deftest filter-matches-tests
(is (= (filter-matches #"1" ss)
{ :123 '([ "comment 1" ]) })))
What I don't understand is the best way to get the regex far enough down into the vals
for each key
, given that they are strings nested inside vectors nested inside lists. I have tried using filter
with nested apply
s or map
s, but I'm confusing myself with the syntax and even if it works I don't know how to hang on to the keys
in order to build up a new hashmap.
I have also tried using destructuring within the filter
function, but there too I'm confusing myself and I also think I have to "lift" the functions across the nested data (I think that's the term—like applicatives and monads in Haskell).
Can somebody please suggest the best approach to filtering this data structure? As a separate matter, I would be glad to have feedback on whether this is a sensible data structure for my purposes, but I would like to learn how to solve this problem as it currently exists, if only for learning purposes.
Thanks much.
Upvotes: 1
Views: 1286
Reputation: 4235
I think your sort of on the right track, but perhaps making life a little harder than it needs to be.
Of greatest concern is your use of regular expressions. While regexp are a good tool for some things, they are often used when other solutions would be better and a lot faster.
One of the key ideas to adopt in clojure is the use of small libraries which you assemble together to get a higher level of abstraction. For example, there are various libraries for handling different spreadsheet formats, such as excel, google docs spreadsheets and there is support for processing CSV files. Therefore, my first step would be to see if you can find a library which will parse your spreadhseet into a standard clojure data structure.
For example, clojure's data.csv will process a CSV spreadsheet into a lazy sequence of vectors where each vector is a line from the spreadsheet and each element in the vector is a column value from that line. Once you have your data in that format, then processing it with map, filter et. al. is fairly trivial.
The next step is to think about the type of abstraction which will make your processing as easy as possible. this will depend largely on what you plan to do, but my suggestion with this sort of data would be to use a nested structure consisting of hash maps which in the outer layer are indexed by your transaction number and each value is then a hash map which has an entry for each column in the spreadsheet.
{:123 {:cmnta ["comment 1" "comment 3"]
:cmntb ["comment 2" "comment 4"]
:analstcmt ["comment 5"]}
:456 {:cmnta ["happy days"]
:cmntb ["are here"]
:analystcmt ["again"]}}
With this structure, you can then use functions like get-in and update-in to access/change the values in your structure i.e.
(get-in m [123 :cmnta]) => ["comment 1" "comment 3"]
(get-in m [123 :cmnta 0]) => "comment 1"
(get-in m [456 :cmnta 1]) => nil
(get-in m [456 :cmnta 1] "nothing to see here - move on") => "nothing to see here - move on"
Upvotes: 0
Reputation: 6033
Here a solution with your data structure.
filter
takes a predicate function. Into that function you can actually get in the data structure to test whatever you need. Here, flatten
helps to remove the list of vector of comments.
(defn filter-all [regex my-ss]
(into {} (filter (fn [[k v]] ; map entry can be destructured into a vector
; flatten the vectors into one sequence
; some return true if there is a match on the comments
(some #(re-matches regex %) (flatten v)))
my-ss)))
user> (filter-all #".*3.*" ss)
{:123 (["comment 1" "comment 2"] ["comment 3" "comment 4"] ["comment 5"])}
For filter-matches
the logic is different : you want to build a new map with some parts of the values. reduce
can help doing that :
(defn filter-matches [regex my-ss]
(reduce (fn [m [k v]] ; m is the result map (accumulator)
(let [matches (filter #(re-matches regex %) (flatten v))]
(when (seq matches)
(assoc m k (vec matches)))))
{}
my-ss))
user> (filter-matches #".*days.*" ss)
{:456 ["happy days"]}
For the data structure itself, if there is no use to keep the nested vectors into the list for each entry, you can simplify with {:123 ["comment1" "comments 2"] ...}
, but it won't drastically simplify the above functions.
Upvotes: 2