Carcigenicate
Carcigenicate

Reputation: 45806

Why does looping over a large amount of data in another thread cause an overactive GC, and prevent some data from being freed?

I'm writing code that takes some lazy results produced by pmap, and draws them onto a BufferedImage. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.

I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.

This is the best MCVE that I've come up with:

(ns mandelbrot-redo.irrelevant.write-image-mcve
  (:import [java.awt.image BufferedImage]
           (java.util.concurrent Executors Executor)))

(defn lazy-producer [width height]
  (for [y (range height)
        x (range width)]
    [x y (+ x y)]))

; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
  (time
    (doseq [[i result] (map vector (range) results)]
      (when (zero? (rem i 1e6))
        (println (str (double (/ i (* width height) 0.01)) "%")))

      ((fn boop [x] x) result)))) ; Data gets consumed here

; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
  (doto
    (Thread. ^Runnable
             (fn []
               (sync-consumer results width height)
               (println "Done...")))
    (.start)))

(defn -main []
  (let [width 5000
        height (int (* width 2/3))]
    (-> (lazy-producer width height)
        (async-consumer width height))))

When -main is run with sync-consumer, it finishes after a few seconds. With async-consumer however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.

If I use an explicit Thread., or use a local thread pool in async-consumer, it hangs and crashes. If I use future however, it finishes fine, just like sync-consumer.

The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Longs when using the async version:

VisualVM Snapshot

The sync version shows a peak amount of Longs to be about 45mb at once in comparison.

The CPU usage is quite different too:

enter image description here

There's massive GC spikes, but it doesn't seem like the Longs are being disposed of.

I could use future for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.

Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?

Can anyone explain this behavior?

Upvotes: 4

Views: 228

Answers (3)

Carcigenicate
Carcigenicate

Reputation: 45806

Thanks to @amalloy and @Alex, I got it working.

I implemented the suggestions by @amalloy in the comments, and both variants work here and in my real case:

; Brittle since "once" is apparently more of an implementation detail of the language
(defn async-consumer [results width height]
  (doto
    (Thread. ^Runnable
             (^:once fn* []
               (sync-consumer results width height)
               (println "Done...")))
    (.start)))

; Arguably less brittle under the assumption that if they replace "once" with another mechanism,
;  they'll update "delay".
(defn async-consumer [results width height]
  (let [d (delay (sync-consumer results width height))]
    (doto
      (Thread. ^Runnable
               (fn []
                 @d
                 (println "Done...")))
      (.start))))

I also tried updating to 1.9.0. I thought that might fix it since @Alex says he's on 1.9.0 and can't reproduce this, and there's also this bug fix that seems related. Unfortunately, I didn't notice any difference.

It would be nice if there was an actual, solid mechanism for this. ^:once seems fine, but I don't want to use if just to have it potentially break later, and the use of delay seems like blatant abuse of the object just to make use of its inner (^:once fn* ...).

Oh well, at least it works now. Thanks guys.

Upvotes: 0

Alex Miller
Alex Miller

Reputation: 70241

The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.

The async one closes over results in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?

I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).

Upvotes: 2

Alan Thompson
Alan Thompson

Reputation: 29984

I simplified you example, and get inconsistent results. I suspect that the manual Thread object is somehow being regarded (sometimes) as a daemon thread, so the JVM sometimes exits before it has completed:

(def N 5e3)
(def total-count (* N N))
(def report-fact (int (/ total-count 20)))

(defn lazy-producer []
  (for [y (range N)
        x (range N)]
    [x y (+ x y)]))

(defn sync-consumer [results]
  (println "sync-consumer:  start")
  (time
    (doseq [[i result] (map vector (range) results)]
      (when (zero? (rem i report-fact))
        (println (str (Math/round (/ (* 100 i) total-count)) " %")))))
  (println "sync-consumer:  stop"))

 (defn async-consumer [results]
  ; (spyx (count results))
  (spyx (nth results 99))
  (let [thread (Thread. (fn []
                          (println "thread  start")
                          (sync-consumer results)
                          (println "thread  done")
                          (flush)))]
    ; (.setDaemon thread false)
    (.start thread)
    (println "daemon?   " (.isDaemon thread))
    thread))

(dotest
  (println "test - start")
  (let [thread (async-consumer
                 (lazy-producer))]
    (when true
      (println "test - sleeping")
      (Thread/sleep 5000))
 ;  (.join thread)
  )
  (println "test - end"))

with results:

----------------------------------
   Clojure 1.9.0    Java 10.0.1
----------------------------------

lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon?    false
test - sleeping
thread  start
sync-consumer:  start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
test - end

Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
60 %
lein test  54.58s user 1.37s system 372% cpu 15.028 total

If we uncomment the (.join thread) line, we get a complete run:

~/expr/demo > lein test

----------------------------------
   Clojure 1.9.0    Java 10.0.1
----------------------------------

lein test tst.demo.core
test - start
(nth results 99) => [99 0 99]
daemon?    false
test - sleeping
thread  start
sync-consumer:  start
0 %
5 %
10 %
15 %
20 %
25 %
30 %
35 %
40 %
45 %
50 %
55 %
60 %
65 %
70 %
75 %
80 %
85 %
90 %
95 %
"Elapsed time: 9388.313828 msecs"
sync-consumer:  stop
thread  done
test - end

Ran 2 tests containing 0 assertions.
0 failures, 0 errors.
lein test  72.52s user 1.69s system 374% cpu 19.823 total

It seems to exit early as if Clojure killed off the manual Thread object.

Perhaps you have found an (intermittent) bug.

Upvotes: 0

Related Questions