Micah
Micah

Reputation: 10395

How To Read Directory Of Files, Line by Line, Lazily in Clojure

(->> "/Users/micahsmith/printio/gooten-import-ai/jupyter/data"
     File.
     file-seq
     (filter #(-> ^File % .getAbsolutePath (str-contains? ".json")))
     (mapcat (fn [^File file]
            (with-open [ rdr (io/reader file)]
              (line-seq rdr)))))

I'm trying to read a directory of json files line-by-line, lazily, so that i can perform an operation lazily on the data.

I keep getting java.io.IOException: Stream closed -- how can i consume this without closing the reader too early?

Upvotes: 1

Views: 488

Answers (2)

amalloy
amalloy

Reputation: 91857

The with-open function is designed to discourage you from doing this, because file handles and other operating system resources are the sort of thing you should handle carefully instead of lazily. You are intended to do all processing of the file contents within the dynamic scope of your with-open. So, instead of returning a lazy sequence, you should accept a function as an argument, and call that function on the lazy sequence while still within the scope of with-open. That function should of course not return another lazy sequence, but instead process its entire input before returning.

So the typical use for such a thing is like this:

(defn process-file [filename process]
  (with-open [f (io/reader filename)]
    (process (line-seq f))))

It's a little more complicated when you have a list of with-open sequences - you can't just call process once. One thing you could do is return a list of the results of calling process on each file:

(defn process-files [filenames process]
  (for [filename filenames]
    (with-open [f (io/reader filename)]
      (process (line-seq f)))))

Then if you need to do some global operation on that, you can reduce over the result of process-files.

Upvotes: 2

Carcigenicate
Carcigenicate

Reputation: 45736

The problem is with-open calls .close when the program exits the scope it's enclosing, but all the lines haven't necessarily been read by that point.

My solution is probably an abusive abomination that should never have seen the light of day, but here's the idea: create a "lazy-seq" that just calls .close, and concatenate it to the end of the line-seq list:

(defn lazy-lines [^File file]
  (let [rdr (io/reader file)]
    (lazy-cat (line-seq rdr)
              (do (.close rdr)
                  nil)))) ; Explicit nil to indicate termination

(defn get-lines [^String path]
  (->> path
       (File.)
       (file-seq)
       (filter #(-> ^File % (.getAbsolutePath) (clojure.string/includes? ".json")))
       (mapcat lazy-lines)))

From my quick testing with files on my Desktop, it appears to work. If you add a println into the terminating lazy-seq, it prints as expected, so the file is being closed.

I'm hesitant to suggest this solution though as it relies on carrying out side effects inside of a lazy-list, which I've been conditioned to "feel wrong" for obvious reasons. The major downside of this method is that the file won't be closed unless the entire sequence is evaluated, and the file will stay open the entire time until the end is reached. Given the constraints though, I don't see how either of these problems could be avoided.


I realized I was using lazy-cat slightly wrong. I had an extra, unnecessary lazy-seq wrapper. It's fixed now. You could also just use something like

(apply concat (line-seq rdr)
              (lazy-seq (do (.close rdr)
                            nil))))))

Instead of lazy-cat.

Upvotes: 1

Related Questions