John Biesnecker
John Biesnecker

Reputation: 3812

More idiomatic line-by-line handling of a file in Clojure

I'm trying to read a file that (may or may not) have YAML frontmatter line-by-line using Clojure, and return a hashmap with two vectors, one containing the frontmatter lines and one containing everything else (i.e., the body).

And example input file would look like this:

---
key1: value1
key2: value2
---

Body text paragraph 1

Body text paragraph 2

Body text paragraph 3

I have functioning code that does this, but to my (admittedly inexperienced with Clojure) nose, it reeks of code smell.

(defn process-file [f]
  (with-open [rdr (java.io.BufferedReader. (java.io.FileReader. f))]
    (loop [lines (line-seq rdr) in-fm 0 frontmatter [] body []]
      (if-not (empty? lines)
        (let [line (string/trim (first lines))]
          (cond
            (zero? (count line))
              (recur (rest lines) in-fm frontmatter body)
            (and (< in-fm 2) (= line "---")) 
              (recur (rest lines) (inc in-fm) frontmatter body)
            (= in-fm 1)  
              (recur (rest lines) in-fm (conj frontmatter line) body)
            :else          
             (recur (rest lines) in-fm frontmatter (conj body line))))
        (hash-map :frontmatter frontmatter :body body)))))

Can someone point me to a more elegant way to do this? I'm going to be doing a decent amount of line-by-line parsing in this project, and I'd like a more idiomatic way of going about it if possible.

Upvotes: 3

Views: 1095

Answers (2)

Francisco Meza
Francisco Meza

Reputation: 883

actually, the idiomatic way to do it using clojure would be to avoid returning 'a hashmap with two vectors' and treat the file as a (lazy) sequence of lines

then, the function that will process the sequence of lines decides whether the file has a YAML frontmatter or not

something like this:

(use '[clojure.java.io :only (reader)])
(let [s (line-seq (reader "YOURFILENAMEHERE"))]
  (if (= "---\n" (take 1 (line-seq (reader "YOURFILENAMEHERE"))))
    (process-seq-with-frontmatter s)
    (process-seq-without-frontmatter s))

by the way, this is a quit and dirty solution; two things to improve:

  1. notice I'm creating two seqs for the same file, it would be better to create just one and make the inspection of the first line so that it wouldn't traverse over the first element of the seq (like a peek instead of a pop)
  2. I think it would be cleaner to have a multimethod 'process-seq' (with a better name of course) that would dispatch based on the content of the first line of the seq

Upvotes: 0

Michał Marczyk
Michał Marczyk

Reputation: 84369

Firstly, I'd put line-processing logic in its own function to be called from a function actually reading in the files. Better yet, you can make the function dealing with IO take a function to map over the lines as an argument, perhaps along these lines:

(require '[clojure.java.io :as io])

(defn process-file-with [f filename]
  (with-open [rdr (io/reader (io/file filename))]
    (f (line-seq rdr))))

Note that this arrangement makes it the duty of f to realize as much of the line seq as it needs before it returns (because afterwards with-open will close the underlying reader of the line seq).

Given this division of responsibilities, the line processing function might look like this, assuming the first --- must be the first non-blank line and all blank lines are to be skipped (as they would be when using the code from the question text):

(require '[clojure.string :as string])

(defn process-lines [lines]
  (let [ls (->> lines
                (map string/trim)
                (remove string/blank?))]
    (if (= (first ls) "---")
      (let [[front sep-and-body] (split-with #(not= "---" %) (next ls))]
        {:front (vec front) :body (vec (next sep-and-body))})
      {:body (vec ls)})))

Note the calls to vec which cause all the lines to be read in and returned in a vector or pair of vectors (so that we can use process-lines with process-file-with without the reader being closed too soon).

Because reading lines from an actual file on disk is now decoupled from processing a seq of lines, we can easily test the latter part of the process at the REPL (and of course this can be made into a unit test):

;; could input this as a single string and split, of course
(def test-lines
  ["---"
   "key1: value1"
   "key2: value2"
   "---"
   ""
   "Body text paragraph 1"
   ""
   "Body text paragraph 2"
   ""
   "Body text paragraph 3"])

Calling our function now:

user> (process-lines test-lines)
{:front ("key1: value1" "key2: value2"),
 :body ("Body text paragraph 1"
        "Body text paragraph 2"
        "Body text paragraph 3")}

Upvotes: 6

Related Questions