Reputation: 1331
I just finish a program which basically collects data and computes several tables (which differ generally by filters or aggregation level (I wrote a SELECT + GROUP BY like Clojure function) that I finally use to compute a summary table. This summary table is used later and is only 30 000 lines long.
To make you understand, this is the command I have to launch :
(def summary-pages (merge-summary-pages (merge-pages-stock (compute-stock) (compute-pages) (compute-xx-prices) (compute-g) (compute-brands) (compute-prices) (compute-sales) (compute-sales-month))
(merge-full-pages-per-page (merge-full-pages (merge-pages-stock (compute-stock) (compute-pages) (compute-xx-prices) (compute-g) (compute-brands) (compute-prices) (compute-sales) (compute-sales-month)) (merge-pages-stock-excluded (compute-pages) (compute-stock) (compute-g) (compute-brands) (compute-prices) (compute-sales) (compute-sales-month))))
(merge-pages-stock-per-page (merge-pages-stock (compute-stock) (compute-pages) (compute-xx-prices) (compute-g) (compute-brands) (compute-prices) (compute-sales) (compute-sales-month)))
(merge-affectations-count (compute-affectations))))
As you can see, I call the same data several time (and in fact (compute-pages)
calls compute-affectations
.
This works but the problem is that compute-pages
and especially compute-affectations
are quite huge queries on google BQ (15 millions lines) and microsoft sql server (45 millions lines).
The problem is that it takes time to query them 4-5 times and I'm also afraid to break the database.
Another problem is that I must compute all compute-affectations
because it comes from sql server and my left-join use group-by.
I tried to split the job with def
but I have GC overhead error.
Because I can clean affectations
after some computation, I tried
(def affectations nil)
...does not change anything, I do not see memory free in microsoft command panel.
Is there a way to clean memory ?
In Python my program works without problem at all (in fact memory usage is at most 80%) but here with a heap space of 13gb I always have problems when I use def for any sub-data. I have 16gb ram so I cannot increase the heap space, plus I find it strange that such "small" data takes so much memory.
I computed test data with csv and the base data is only 3.3gb...
EDIT
Working code (part of it) :
(let [_ (init-config! path)
_ (write-affectations)
_ (write-pages)
_ (spit "prepared-pages.edn" (prn-str (prepare-pages (read-pages))) :append :true)
_ (write-stock)
_ (write-prices)
_ (write-xx-prices)
_ (write-brands)
_ (write-g)
_ (write-vehicles)
_ (write-sales)
_ (write-sales-month)
_ (System/gc)
stock (read-stock)
affectations (read-affectations)
pages (read-pages)
prepared-pages (prepare-pages pages)
xx-prices (read-xx-prices)
g (read-g)
brands (read-brands)
prices (read-prices)
sales (read-sales)
sales-month (read-sales-month)
pages-stock (merge-pages-stock stock prepared-pages xx-prices g brands prices sales sales-month)
pages-stock-excluded (merge-pages-stock-excluded prepared-pages stock g brands prices sales sales-month)
full-pages-per-page (-> (merge-full-pages pages-stock pages-stock-excluded)
(merge-full-pages-per-page))
pages-stock-per-page (merge-pages-stock-per-page pages-stock)
affectations-count (merge-affectations-count affectations)
summary-pages (doall (merge-summary-pages pages-stock full-pages-per-page pages-stock-per-page affectations-count))
_ (System/gc)
_ (io/delete-file "affectations.edn")
_ (io/delete-file "pages.edn")
_ (io/delete-file "prepared-pages.edn")
_ (io/delete-file "stock.edn")
_ (io/delete-file "prices.edn")
_ (io/delete-file "xx-prices.edn")
_ (io/delete-file "brands.edn")
_ (io/delete-file "g.edn")
_ (io/delete-file "vehicles.edn")
_ (io/delete-file "sales.edn")
_ (io/delete-file "sales-month.edn")
I write the content of queries on HDD (.edn files), then I read it lazily and pass it to function.
Thanks !
Upvotes: 1
Views: 536
Reputation: 91554
Without knowing exactly what these do it's hard to check for situations like:
This last one is clearly part of (or perhaps all of) your problem. I de-duplicated your example so each value is only computed once.
(def summary-pages
(let [stock (compute-stock)
pages (compute-pages)
prices (compute-prices)
brands (compute-brands)
sales-month (compute-sales-month)
g (compute-g)
sales (compute-sales)
xx-prices (compute-xx-prices)
pages-stock (merge-pages-stock stock pages xx-prices g brands prices sales sales-month)
pages-stock-excluded (merge-pages-stock-excluded pages stock g brands prices sales sales-month)
full-pages-per-page (merge-full-pages-per-page (merge-full-pages pages-stock pages-stock-excluded))
pages-stock-per-page (merge-pages-stock-per-page pages-stock)
affectations-count (merge-affectations-count (compute-affectations))]
(merge-summary-pages pages-stock
full-pages-per-page
pages-stock-per-page
affectations-count)))
your next step is to comment out all but the first one, verify that it runs correctly and in a reasonable amount of time. Then uncomment the next and repeat.
here is an example of a sequence that "should" cause an out of memory exception by holding the head of a sequence of a billion Integers:
user> (time
(let [huge-sequence (range 1e9)]
(last huge-sequence)))
"Elapsed time: 49406.048091 msecs"
999999999
but it does not? why?
It must have somehow figured out that it does not actually need to store the head of the huge-sequence because nobody will be using it after! So let's test this by changing the example to it must hold the head and see if it still works
user> (time
(let [huge-sequence (range 1e9)]
(last huge-sequence)
(first huge-sequence)))
This time my CPU fan kicks in and all 8 processors spin up to 100% immediately. After five minutes I'm starting to get impatient, so I'll head out to lunch and see if it finishes while I'm eating. So far our hypothesis is looking accurate.
So when you define a bunch of sequences in a big let
block and then make a single function call that uses them a bunch of times and never touches them after that call, the heads of the sequence can usually be cleared at function call time and it will "magically" just work. This is why I suggested that you test these one at a time from top to bottom.
There is almost never a reason in Clojure to define the same data twice, we take referential transparency seriously around here.
user> (time
(let [huge-sequence (range 1e9)]
(last huge-sequence)
(first huge-sequence)))
OutOfMemoryError Java heap space [trace missing]
Upvotes: 2