Reputation: 3247
I am trying to parse a 50MB CSV file. ~2500 rows, ~5500 columns, one column is strings (date as yyyy-mm-dd) and the rest is floats with lots of empty points. I need to be able to access all the data so would like to realize the full file, which should be possible at that size.
I've tried a few options from:
(with-open [rdr (io/reader path)] (doall (csv/read-csv rdr))))
to slightly more manual ways using line-seq
and parsing the string into numbers manually.
My JVM usage on a single slurp
goes up 100MB, 2x the file size. On parsing the data I go up 1-2GB depending on how it's done. If I open and parse the file several times into the same variable, memory usage keep going up and I end up with a memory error and the program fails. (I understand looking at the task manager isn't the best way to look at memory leaks, but the fact is the program fails so there is a leak somewhere)
What is the right way of opening the file? My final use case is I'll be getting a new file every day and I want a server application to open the file and crunch data every day without running out of memory and needing to restart the server.
Edit: for comparison reading that file with Python pandas will consume about 100MB of memory, and subsequent re-rereading of the file won't keep increasing memory usage.
Edit2: here's a minimal example using local atoms to try and see what's going on:
(defn parse-number [s] (if (= s "") nil (read-string s)))
(defn parse-line [line]
(let [result (atom [])]
(doseq [x (clojure.string/split line #",")]
(swap! result conj (parse-number x)))
@result))
(defn line-by-line-parser [file]
(let [result (atom [])]
(with-open [rdr (clojure.java.io/reader file)]
(doseq [line (line-seq rdr)]
(swap! result conj (parse-line line)))
@result)))
;in the repl:
(def x (line-by-line-parser "C:\\temp\\history.csv")) ; memory goes up 1GB
(def x (line-by-line-parser "C:\\temp\\history.csv")) ; memory goes up an extra 1GB
; etc
Thanks a lot!
Upvotes: 5
Views: 347
Reputation: 1976
As long as you don't keep your parsed data under any GC roots (like def
or memoize
functions), the code you shown above should not leak. You can easily prove that by looping your code for 100 times and see if you get an OOM (I don't expect any). Having said that, there are things you can do to relieve memory pressure as suggested by others.
If you want to know exactly where the memory go, pick up a profiler like this to deep dive into it.
My hunch on your case is just GC pressure (not leak). Specifically on the use of read-string
, much more than conj
/atom
. Try replace read-string
with something more low level (e.g. Integer/parse
) and you should see a big difference. conj
on other the other hand, is super efficient from a persistent data-structure perspective (which Python don't use) but of course it will never beat primitive array (which Python use). atom
is usually used for concurrency. In your case, it can be replaced with transient
(and persistent!
) but I don't expect it will make a big difference.
Update - add allocation flame graph
read-string
uses up 70% of memory allocation while runningUpvotes: 2
Reputation: 2982
As I mentioned in my comment, there are two things that bother me seeing how you use atoms:
swap
call generates a bunch of garbage objects which could explain the memory usage.Number one and number two can be solved using into
or reduce
. As a benefit, the resulting code will be shorter.
I am more familiar with reduce
, so here is an example using it:
(defn parse-number [s] (if (= s "") nil (read-string s)))
(defn parse-line [line]
(reduce #(conj %1 (parse-number %2))
[]
(clojure.string/split line #",")))
(defn line-by-line-parser [file]
(with-open [rdr (clojure.java.io/reader file)]
(reduce #(conj %1 (parse-line %2))
[]
(line-seq rdr))))
As I do not have your test data I can only guess that it could solve your problem. So, I would be happy if you tested it and report if there are any improvements.
Upvotes: 0