Parjánya
Parjánya

Reputation: 25

Null char returning from reading a file in Common Lisp

I’m reading files and storing them as a string using this function:

(defun file-to-str (path)
  (with-open-file (stream path) :external-format 'utf-8
          (let ((data (make-string (file-length stream))))
            (read-sequence data stream)
            data)))

If the file has only ASCII characters, I get the content of the files as expected; but if there are characters beyond 127, I get a null character (^@), at the end of the string, for each such character beyond 127. So, after $ echo "~a^?" > ~/teste I get

CL-USER> (file-to-string "~/teste")
"~a^?
"

; but after echo "aaa§§§" > ~/teste , the REPL gives me

CL-USER> (file-to-string "~/teste")
"aaa§§§
^@^@^@"

and so forth. How can I fix this? I’m using SBCL 1.4.0 in an utf-8 locale.

Upvotes: 2

Views: 413

Answers (1)

jlahd
jlahd

Reputation: 6303

First of all, your keyword argument :external-format is misplaced and has no effect. It should be inside the parenteses with stream and path. However, this has no effect to the end result, as UTF-8 is the default encoding.

The problem here is that in UTF-8 encoding, it takes a different number of bytes to encode different characters. ASCII characters all encode into single bytes, but other characters take 2-4 bytes. You are now allocating, in your string, data for every byte of the input file, not every character in it. The unused characters end up unchanged; make-string initializes them as ^@.

The (read-sequence) function returns the index of the first element not changed by the function. You are currently just discarding this information, but you should use it to resize your buffer after you know how many elements have been used:

(defun file-to-str (path)
  (with-open-file (stream path :external-format :utf-8)
    (let* ((data (make-string (file-length stream)))
           (used (read-sequence data stream)))
      (subseq data 0 used))))

This is safe, as length of the file is always greater or equal to the number of UTF-8 characters encoded in it. However, it is not terribly efficient, as it allocates an unnecessarily large buffer, and finally copies the whole output into a new string for returning the data.

While this is fine for a learning experiment, for real-world use cases I recommend the Alexandria utility library that has a ready-made function for this:

* (ql:quickload "alexandria")
To load "alexandria":
  Load 1 ASDF system:
    alexandria
; Loading "alexandria"
* (alexandria:read-file-into-string "~/teste")
"aaa§§§
"
*

Upvotes: 4

Related Questions