Reputation: 6088
I'm trying to parallelize the function below. I refactored this from a for
statement and implemented pmap
to speed up reading the xml data, which went well. The next bottleneck is in my keep
statement. How can I improve performance here?
I've tried (keep #(when (pmap #(later-date? (second %) after) zip) [(first %) (second %)]) zip)
but nested #()
functions are not allowed. I've also tried wrapping the keep
as well as the call to raw-url-data in a future
but dereferencing either in the calling function produces nil.
(defn- raw-url-data
"Parse xmlzip data and return a sequence of URLs/modtime vectors."
[data after]
(let [article (xz/xml-> data :url)
loc (pmap #(-> (xz/xml-> % :loc xz/text) first) article)
mod (pmap #(-> (xz/xml-> % :lastmod xz/text) first
parse-modtime) article)
zip (zipmap loc mod)]
(keep #(when (later-date? (second %) after)
[(first %) (second %)]) zip)))
And here is my later-date? function:
(defn- later-date?
"Return TRUE if DATETIME is after AFTER or either one is NIL."
[datetime after]
(or (nil? datetime)
(nil? after)
(time/after? datetime after)))
Upvotes: 1
Views: 188
Reputation: 91534
With this type of problem getting the time spent splitting the data up for parallel processing and then putting it back together to be less than the time to process it in a sequence can be tricky.
In the problem above, if i'm interpreting it correctly you are generating two sequences of data, each in parallel. So these sequences can't communicate with each other during this process to see if they have a later date. Once all of the data for both sequences is finished then you form it into a map. and then split that map back into a sequence and start processing it.
The first pair of dates, (first loc) and (first mob), will be sitting for quite a while before they can be compared to see if they should go into the final result. so the best speedup may come from simply removing the call to zipmap.
time/after?
is very fast so you will almost certainly loose time by calling pmap here, though it's good to know how to do it anyway. You can get aroung the inability of the anonymous function macro to handle nested anonymous functions by making one of tham a call to fn
like so:
(keep (fn [x] (when (pmap #(later-date? (second x) after) zip)) [(first %) (second %)])
Another approach is to
Then adjust the partition size until you see a benefit over the splitting costs.
This has been discussed here, and here
Upvotes: 2