Reputation: 45806
I'm writing code that takes some lazy results produced by pmap
, and draws them onto a BufferedImage
. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.
I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.
This is the best MCVE that I've come up with:
(ns mandelbrot-redo.irrelevant.write-image-mcve
(:import [java.awt.image BufferedImage]
(java.util.concurrent Executors Executor)))
(defn lazy-producer [width height]
(for [y (range height)
x (range width)]
[x y (+ x y)]))
; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i 1e6))
(println (str (double (/ i (* width height) 0.01)) "%")))
((fn boop [x] x) result)))) ; Data gets consumed here
; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(fn []
(sync-consumer results width height)
(println "Done...")))
(.start)))
(defn -main []
(let [width 5000
height (int (* width 2/3))]
(-> (lazy-producer width height)
(async-consumer width height))))
When -main
is run with sync-consumer
, it finishes after a few seconds. With async-consumer
however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.
If I use an explicit Thread.
, or use a local thread pool in async-consumer
, it hangs and crashes. If I use future
however, it finishes fine, just like sync-consumer
.
The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Long
s when using the async version:
The sync version shows a peak amount of Long
s to be about 45mb at once in comparison.
The CPU usage is quite different too:
There's massive GC spikes, but it doesn't seem like the Long
s are being disposed of.
I could use future
for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.
Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?
Can anyone explain this behavior?
Upvotes: 4
Views: 228
Reputation: 45806
Thanks to @amalloy and @Alex, I got it working.
I implemented the suggestions by @amalloy in the comments, and both variants work here and in my real case:
; Brittle since "once" is apparently more of an implementation detail of the language
(defn async-consumer [results width height]
(doto
(Thread. ^Runnable
(^:once fn* []
(sync-consumer results width height)
(println "Done...")))
(.start)))
; Arguably less brittle under the assumption that if they replace "once" with another mechanism,
; they'll update "delay".
(defn async-consumer [results width height]
(let [d (delay (sync-consumer results width height))]
(doto
(Thread. ^Runnable
(fn []
@d
(println "Done...")))
(.start))))
I also tried updating to 1.9.0. I thought that might fix it since @Alex says he's on 1.9.0 and can't reproduce this, and there's also this bug fix that seems related. Unfortunately, I didn't notice any difference.
It would be nice if there was an actual, solid mechanism for this. ^:once
seems fine, but I don't want to use if just to have it potentially break later, and the use of delay
seems like blatant abuse of the object just to make use of its inner (^:once fn* ...)
.
Oh well, at least it works now. Thanks guys.
Upvotes: 0
Reputation: 70241
The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.
The async one closes over results
in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?
I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).
Upvotes: 2
Reputation: 29984
I simplified you example, and get inconsistent results. I suspect that the manual Thread
object is somehow being regarded (sometimes) as a daemon thread, so the JVM sometimes exits before it has completed:
(def N 5e3)
(def total-count (* N N))
(def report-fact (int (/ total-count 20)))
(defn lazy-producer []
(for [y (range N)
x (range N)]
[x y (+ x y)]))
(defn sync-consumer [results]
(println "sync-consumer: start")
(time
(doseq [[i result] (map vector (range) results)]
(when (zero? (rem i report-fact))
(println (str (Math/round (/ (* 100 i) total-count)) " %")))))
(println "sync-consumer: stop"))
(defn async-consumer [results]
; (spyx (count results))
(spyx (nth results 99))
(let [thread (Thread. (fn []
(println "thread start")
(sync-consumer results)
(println "thread done")
(flush)))]
; (.setDaemon thread false)
(.start thread)
(println "daemon? " (.isDaemon thread))
thread))
(dotest
(println "test - start")
(let [thread (async-consumer
(lazy-producer))]
(when true
(println "test - sleeping")
(Thread/sleep 5000))
; (.join thread)
)
(println "test - end"))
with results:
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
60 %
lein test 54.58s user 1.37s system 372% cpu 15.028 total
If we uncomment the (.join thread)
line, we get a complete run:
~/expr/demo > lein test
----------------------------------
Clojure 1.9.0 Java 10.0.1
----------------------------------
lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon? false
test - sleeping
thread start
sync-consumer: start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
90 %
95 %
"Elapsed time: 9388.313828 msecs"
sync-consumer: stop
thread done
test - end
Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
lein test 72.52s user 1.69s system 374% cpu 19.823 total
It seems to exit early as if Clojure killed off the manual Thread
object.
Perhaps you have found an (intermittent) bug.
Upvotes: 0