Reputation: 109
I'm fairly new to Clojure and I have some code that I am trying to optimise. I want to compute concurrance-counts. The main function is compute-space and the output is a nested map of the type
{"w1" {"w11" 10, "w12" 31, ...}
"w2" {"w21" 14, "w22" 1, ...}
...
}
meaning that "w1" cooccurs with "w11" 10 times, etc...
It takes a coll of documents (sentences) and a coll of target words, it iterates over both and finally applies the context-fn such as sliding-window to extract context-words. More concretely I am passing a closure over sliding-window
(compute-space docs (fn [target doc] (sliding-window target doc 5)) targets)
I've been testing it with around 50 million words (~ 3 million sentences) and ca. 20,000 targets. This version would take more than a day to complete. I also wrote a pmap parallel function (pcompute-space) that would reduce computing time to around 10 hours, but I still I feel it should be faster. I don't have other code to compare, but my intuition says it should be faster.
(defn compute-space
([docs context-fn targets]
(let [space (atom {})]
(doseq [doc docs
target targets]
(when-let [contexts (context-fn target doc)]
(doseq [w contexts]
(if (get-in @space [target w])
(swap! space update-in [target w] (partial inc))
(swap! space assoc-in [target w] 1)))))
@space)))
(defn sliding-window
[target s n]
(loop [todo s seen [] acc []]
(let [curr (first todo)]
(cond (= curr target) (recur (rest todo) (cons curr seen) (concat acc (take n seen) (take n (rest todo))))
(empty? todo) acc
:else (recur (rest todo) (cons curr seen) acc)))))
(defn pcompute-space
[docs step context-fn targets]
(reduce
#(deep-merge-with + %1 %2)
(pmap
(fn [chunk]
(do (tick))
(compute-space chunk context-fn targets))
(partition-all step docs)))
I profiled the application with jvisualvm and I found out that clojure.lang.Cons, clojure.lang.ChunkedCons and clojure.lang.ArrayChunk are dominating the process quite excessively (see picture). This surely has to do with the fact that I am using this double doseq loop, (previous experiments showed that this way was faster than using map, reduce and the like, although I was using time for benchmarking the functions). I'd very thankful for any insights you could provide me, and suggestions for refactor the code and make it run faster. I guess reducers could be of some help here, but I'm not sure as to how and/or why.
SPECS
MacPro 2010 2,4 GHz Intel Core 2 Duo 4 GB RAM
Clojure 1.6.0
Java 1.7.0_51 Java HotSpot(TM) 64-Bit Server VM
GithubGist with the entire code
Upvotes: 4
Views: 805
Reputation: 13473
Analysis of the compute-space
algorithm in the question
The cost of scanning the sentences - looking for targets -
The cost of dealing with the targets
Major improvement
The context-fn
scans a sentence looking for a target. If there are ten-thousand targets, it scans the sentence ten-thousand times.
Far better scan the sentence once, looking for all the targets. If we keep the targets as a (hash) set, we can test whether a word is a target in more or less constant time, no matter how many targets there are.
Possible improvement
The sliding-windows
function generates contexts by passing each word from hand to hand - from todo
to seen
. It is probably quicker to pour the words into a vector, then return the contexts as subvec
s.
However it is done, a simple way to organise generating the contexts is to have the context-fn
return the sequence of contexts corresponding to the sequence of words. A function that does this for sliding-windows
is
(defn sliding-windows [w s]
(let [v (vec s), n (count v)
window (fn [i] (lazy-cat (subvec v (max (- i w) 0) i)
(subvec v (inc i) (min (inc (+ i w)) n))))]
(map window (range n))))
We can now define the compute-space
function in terms of the new kind of contexts-fn
as follows:
(defn compute-space [docs contexts-fn target?]
(letfn [(stuff [s] (->> (map vector s (contexts-fn s))
(filter (comp target? first))))]
(reduce
(fn [a [k m]] (assoc a k (merge-with + (a k) (frequencies m))))
{}
(mapcat stuff docs))))
The code pivots on stuff
:
stuff
as a sequence of [target context-sequence]
pairs. Results
This algorithm is about 500 times faster than that in the question: what the code in the question accomplishes in a day and a half, this should perform in about a minute.
Given
this code constructs the context map in 100 msec.
For a sentence a tenth as long - 10,000 words - the code in the question takes 5 sec.
This is using (long) integers rather than strings as "words". So the work of assembling the strings with their hash values will dilute the improvement somewhat.
Note
I took the original version of this answer down because
With accurate tests - conducted by Criterium - the version that uses transient maps turns out to be a little slower, so has been omitted.
Upvotes: 1
Reputation: 18410
Test data was:
Quite a bit smaller than your work load. Criterium was used to collect timing. Criterium computes an expression multiple times, first warming up the JIT and then to collect average data.
With my test data, and your code, compute-space
took 22 seconds:
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
Execution time mean : 21.989189 sec
Execution time std-deviation : 471.199127 ms
Execution time lower quantile : 21.540155 sec ( 2.5%)
Execution time upper quantile : 23.226352 sec (97.5%)
Overhead used : 13.353852 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 9.4329 % Variance is slightly inflated by outliers
First optimization Updated to use frequencies
to go from vector of words to map of words and counts of their occurrences.
To help me understand the structure of the computation I wrote a separate function that takes a collection of documents, context-fn
and a single target, and returns the map of context words to counts. The inner map for one target of what is returned by compute-space
. Wrote this using built in Clojure functions, instead of updating counts.
(defn compute-context-map-f [documents context-fn target]
(frequencies (mapcat #(context-fn target %) documents)))
With compute-context-map-f
written compute-space
, named compute-space-f here
, is fairly short:
(defn compute-space-f [docs context-fn targets]
(into {} (map #(vector % (compute-context-map-f docs context-fn %)) targets)))
Timing, with the same data as above is 65% of original version:
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
Execution time mean : 14.274344 sec
Execution time std-deviation : 345.240183 ms
Execution time lower quantile : 13.981537 sec ( 2.5%)
Execution time upper quantile : 15.088521 sec (97.5%)
Overhead used : 13.353852 ns
Found 3 outliers in 60 samples (5.0000 %)
low-severe 1 (1.6667 %)
low-mild 2 (3.3333 %)
Variance from outliers : 12.5419 % Variance is moderately inflated by outliers
Parallelize first optimization
I chose to chunk by target instead of document, so that merging the maps together would not require modifying the {context-word count, ...}
maps for a target.
(defn pcompute-space-f [docs step context-fn targets]
(into {} (pmap #(compute-space-f docs context-fn %) (partition-all step targets))))
Timing, with the same data as above is 16% of original version:
user> (criterium.core/bench (pcompute-space-f documents 4 #(sliding-window %1 %2 5) keywords))
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
Evaluation count : 60 in 60 samples of 1 calls.
Execution time mean : 3.623018 sec
Execution time std-deviation : 83.780996 ms
Execution time lower quantile : 3.486419 sec ( 2.5%)
Execution time upper quantile : 3.788714 sec (97.5%)
Overhead used : 13.353852 ns
Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
Variance from outliers : 11.0038 % Variance is moderately inflated by outliers
Specifications
TBD
Further optimizations.
Describe the test data.
Upvotes: 4