Reputation: 3247

Clojure - Memory usage parsing smallish CSV file

I am trying to parse a 50MB CSV file. ~2500 rows, ~5500 columns, one column is strings (date as yyyy-mm-dd) and the rest is floats with lots of empty points. I need to be able to access all the data so would like to realize the full file, which should be possible at that size.

I've tried a few options from:

(with-open [rdr (io/reader path)] (doall (csv/read-csv rdr))))

to slightly more manual ways using line-seq and parsing the string into numbers manually.

My JVM usage on a single slurp goes up 100MB, 2x the file size. On parsing the data I go up 1-2GB depending on how it's done. If I open and parse the file several times into the same variable, memory usage keep going up and I end up with a memory error and the program fails. (I understand looking at the task manager isn't the best way to look at memory leaks, but the fact is the program fails so there is a leak somewhere)

What is the right way of opening the file? My final use case is I'll be getting a new file every day and I want a server application to open the file and crunch data every day without running out of memory and needing to restart the server.

Edit: for comparison reading that file with Python pandas will consume about 100MB of memory, and subsequent re-rereading of the file won't keep increasing memory usage.

Edit2: here's a minimal example using local atoms to try and see what's going on:

(defn parse-number [s] (if (= s "") nil (read-string s)))

(defn parse-line [line]
  (let [result (atom [])]
    (doseq [x (clojure.string/split line #",")]
      (swap! result conj (parse-number x)))
    @result))

(defn line-by-line-parser [file]
  (let [result (atom [])]
    (with-open [rdr (clojure.java.io/reader file)]
      (doseq [line (line-seq rdr)]
        (swap! result conj (parse-line line)))
      @result)))

;in the repl:
(def x (line-by-line-parser "C:\\temp\\history.csv")) ; memory goes up 1GB
(def x (line-by-line-parser "C:\\temp\\history.csv")) ; memory goes up an extra 1GB
; etc

Thanks a lot!

Upvotes: 5

Answers (2)

rmcv

Reputation: 1976

As long as you don't keep your parsed data under any GC roots (like def or memoize functions), the code you shown above should not leak. You can easily prove that by looping your code for 100 times and see if you get an OOM (I don't expect any). Having said that, there are things you can do to relieve memory pressure as suggested by others.

If you want to know exactly where the memory go, pick up a profiler like this to deep dive into it.

My hunch on your case is just GC pressure (not leak). Specifically on the use of read-string, much more than conj/atom. Try replace read-string with something more low level (e.g. Integer/parse) and you should see a big difference. conj on other the other hand, is super efficient from a persistent data-structure perspective (which Python don't use) but of course it will never beat primitive array (which Python use). atom is usually used for concurrency. In your case, it can be replaced with transient (and persistent!) but I don't expect it will make a big difference.

Update - add allocation flame graph

As you can see, read-string uses up 70% of memory allocation while running

Upvotes: 2

Joshua

Reputation: 2982

As I mentioned in my comment, there are two things that bother me seeing how you use atoms:

The way you use them looks very imperative-like. Clojure is a funcitonal programming language and the way you use them is not very idiomatic.
My guess is that each swap call generates a bunch of garbage objects which could explain the memory usage.

Number one and number two can be solved using into or reduce. As a benefit, the resulting code will be shorter.

I am more familiar with reduce, so here is an example using it:

(defn parse-number [s] (if (= s "") nil (read-string s)))

(defn parse-line [line]
  (reduce #(conj %1 (parse-number %2))
          []
          (clojure.string/split line #",")))

(defn line-by-line-parser [file]
  (with-open [rdr (clojure.java.io/reader file)]
    (reduce #(conj %1 (parse-line %2))
            []
            (line-seq rdr))))

As I do not have your test data I can only guess that it could solve your problem. So, I would be happy if you tested it and report if there are any improvements.

Upvotes: 0

Clojure - Memory usage parsing smallish CSV file

Answers (2)

Related Questions