jmargolisvt
jmargolisvt

Reputation: 6088

How to parallelize Clojure keep function?

I'm trying to parallelize the function below. I refactored this from a for statement and implemented pmap to speed up reading the xml data, which went well. The next bottleneck is in my keep statement. How can I improve performance here?

I've tried (keep #(when (pmap #(later-date? (second %) after) zip) [(first %) (second %)]) zip) but nested #() functions are not allowed. I've also tried wrapping the keep as well as the call to raw-url-data in a future but dereferencing either in the calling function produces nil.

(defn- raw-url-data
  "Parse xmlzip data and return a sequence of URLs/modtime vectors."
  [data after]
  (let [article (xz/xml-> data :url)
        loc (pmap #(-> (xz/xml-> % :loc xz/text) first) article)
        mod (pmap #(-> (xz/xml-> % :lastmod xz/text) first
               parse-modtime) article)
        zip (zipmap loc mod)]
    (keep #(when (later-date? (second %) after)
             [(first %) (second %)]) zip)))

And here is my later-date? function:

(defn- later-date?
  "Return TRUE if DATETIME is after AFTER or either one is NIL."
  [datetime after]
  (or (nil? datetime)
      (nil? after)
      (time/after? datetime after)))

Upvotes: 1

Views: 188

Answers (1)

Arthur Ulfeldt
Arthur Ulfeldt

Reputation: 91534

With this type of problem getting the time spent splitting the data up for parallel processing and then putting it back together to be less than the time to process it in a sequence can be tricky.

In the problem above, if i'm interpreting it correctly you are generating two sequences of data, each in parallel. So these sequences can't communicate with each other during this process to see if they have a later date. Once all of the data for both sequences is finished then you form it into a map. and then split that map back into a sequence and start processing it.

The first pair of dates, (first loc) and (first mob), will be sitting for quite a while before they can be compared to see if they should go into the final result. so the best speedup may come from simply removing the call to zipmap.

time/after? is very fast so you will almost certainly loose time by calling pmap here, though it's good to know how to do it anyway. You can get aroung the inability of the anonymous function macro to handle nested anonymous functions by making one of tham a call to fn like so:

(keep (fn [x] (when (pmap #(later-date? (second x) after) zip)) [(first %) (second %)])

Another approach is to

  1. break it into partitions,
  2. do all the processing on each partition, and
  3. merge them back together.

Then adjust the partition size until you see a benefit over the splitting costs.

This has been discussed here, and here

Upvotes: 2

Related Questions