Reputation: 73
I'm reading about how lazy sequences can cause OutOfMemoryError's when using, say, loop/recur on large sequences. I'm trying to load in a 3MB file from memory to process it, and I think this is happening to me. But, I don't know if there's an idiomatic way to fix it. I tried putting in doall's, but then my program didn't seem to terminate. Small inputs work:
Small input (contents of file): AAABBBCCC Correct output: ((65 65) (65 66) (66 66) (67 67) (67 67))
Code:
(def file-path "/Users/me/Desktop/temp/bob.txt")
;(def file-path "/Users/me/Downloads/3MB_song.m4a")
(def group-by-twos
(fn [a-list]
(let [first-two (fn [a-list] (list (take 2 a-list)))
the-rest-after-two (fn [a-list] (rest (rest a-list)))
only-two-left? (fn [a-list] (if (= (count a-list) 2) true false))]
(loop [result '() rest-of-list a-list]
(if (nil? rest-of-list)
result
(if (only-two-left? rest-of-list)
(concat result (list rest-of-list))
(recur (concat result (first-two rest-of-list))
(the-rest-after-two rest-of-list))))))))
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
intermediate-array (byte-array (.length the-file-pointer))] ;reserve space for final length
(.readFully the-file-pointer intermediate-array)
(group-by-twos (seq intermediate-array)))))
(get-the-file file-path)
As I said above, when I put in doalls in a bunch of places, it didn't seem to finish. How can I get this to run for large files, and is there a way to get rid of the cognitive burden of doing whatever I need to do? Some rule?
Upvotes: 1
Views: 525
Reputation: 39
Beware of clojure data structures when dealing with large amounts of data. (typical Clojure app uses two to three times as much memory than the same Java application - sequences are memory expensive). If you can read the whole data into an array, do that. Then process it while making sure you don't keep reference to any sequence head to ensure garbage collection happens during the process.
Also strings are much bigger than char primitives. Single char string is 26 bytes and char is 2 bytes. Even if you don't like using arrays, arraylist is several times smaller than a sequence or a vector.
Upvotes: 1
Reputation: 1085
I guess an idiomatic solution would be:
(partition 2 (map int (slurp "/Users/me/Desktop/temp/bob.txt")))
This is not fully lazy as the full file is loaded into memory, but it should work without problems for files that are not too big. However partition and map are lazy so if you replace slurp by a buffered reader you will get a fully lazy version.
Note: this will swallow the last char if the size of the file is odd. It is not clear what you expect if the size is odd. If you want to have the last value in its own list, you can use (partition 2 2 [] ... )
user=> (partition 2 (map int "ABCDE"))
((65 66) (67 68))
user=> (partition 2 2 [] (map int "ABCDE"))
((65 66) (67 68) (69))
Upvotes: 2
Reputation: 33637
You are reading the file completely in memory and then creating a seq on this byte array which doesn't really give you any benefit of lazy sequence as all the data required is already loaded in memory and lazy sequence really means that produce/generate data when it is required.
What you can do is create the seq over the file content using something like:
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
file-len (.length the-file-pointer)] ;get file len
(partition 2 (map (fn [_] (.readByte the-file-pointer)) (range file-len))))))
NOTE: I haven't really tried it but I hope it gives you the idea at least about the lazy file reading part
Upvotes: 2