HopedWall
HopedWall

Reputation: 73

With-open-file reading extra characters

I'm trying to read a file into a string (and not into a list) in Common Lisp, however I end up with extra characters at the end of the string. This only happens when the file contains characters such as newlines or tabs; whitespaces seem to work just fine. Here's my code:

(defun load-file (filename)
  (with-open-file (stream filename 
                          :direction :input 
                          :if-does-not-exist :error)
    (let ((contents (make-string (file-length stream))))
      (read-sequence contents stream)
       contents)))

Please note: unfortunately I'm not allowed to use neither loops nor external libraries in this program.

Upvotes: 3

Views: 707

Answers (1)

user5920214
user5920214

Reputation:

This is an old problem and the answer is 'don't do that'. The reason for this is that file-length can't do what you want it to do in many interesting cases. In particular, a version of file-length which works the way you expect, returning the number of characters in the file, is easy to implement only if one or both of the following are true:

  • the number of characters in the file is some fixed multiple of the number of bytes in the file;
  • the OS you are using records the number of characters in the file for you.

Sadly neither of these things is true for any modern platform I know of:

  • the number of characters in the file is not a fixed multiple of the number of bytes in it for at least two reasons:

    • line-end encodings mean that the file may contain two characters at the end of lines (#\Return #\Newline) which will be read as one;
    • files may use encodings which don't map bytes onto characters in any simple way, such as UTF-8, quite apart from line-ending sequences;
  • but the OS tells you only the number of bytes in the file.

For platforms like this, the only way for file-length to tell you what you want to know for file you are reading as a stream of characters is to read and decode the whole file, and this is clearly undesirable. In practice file-length tells you the byte length of the file only.

So this trick of 'work out the length of the file and slurp it in one big chunk' can't work in general, because the length of the file in characters can't be known without reading it.

It is slightly annoying (and I think a mild deficiency of CL) that it doesn't include a function whose contract is 'read this file and return a string which contains it'.

I believe that it is the case, at least for common encodings, that the character length of the file will never be longer than the byte length. So if you are willing to live a bit dangerously, one thing you can do is to allocate an array which is the byte length of the file, read the file, and then note how much of the array you filled (for added cleverness use an adjustable array and adjust it after reading to be the right length).


Note that Alexandria contains a function, read-file-into-string, which does what you want and is portable and probably fast.


Here is a fairly naive version which I think works in most cases (it doesn't think about the element types of strings at all):

(defun file->string (f &key (buffer-size 1024))
  (with-open-file (in f :direction :input)
    (with-output-to-string (out)
      (loop with buffer = (make-string buffer-size)
            for nchars = (read-sequence buffer in)
            do (write-sequence buffer out :start 0 :end nchars)
            while (= nchars buffer-size)))))

Here is a partly-tested, much hairier, function which tries to be much more clever, and deals with the case where the file is shorter in bytes than it is in characters (which can occur even on sane platforms if the file is being appended to while it is being read). The branch of the code that deals with this has not been tested: caveat emptor.

This also copies the data less in most cases, but the string it returns will in general have some wasted space in it. It assumes that fill pointers are cheap (they should be) and that resizing an array is acceptable only as a last resort: so when it needs to make the string shorter it does so by setting the fill pointer rather than by resizing it, only resizing it when it needs to make it longer.

It also mildly assumes that tail calls are optimised.

(defun file->string (f &key (element-type ':default)
                       (external-format ':default)
                       (growth-factor 0.1))
  "Read a file into a string, dealing with character encoding issues"
  ;; This attempts to be efficient: it allocates a string which, if
  ;; there are slightly fewer characters than bytes in the file (which
  ;; is the case for common encodings, will be a little too large,
  ;; then reads the file into it in one fell swoop, setting the
  ;; fill-pointer correctly after doing so if needed.  It also
  ;; attempts to deal with the case where the file is *shorter* in
  ;; bytes than it is in characters (this might be true if the file
  ;; was being appended to as the read is happening, or on some
  ;; platform which compresses files and reports the compressed
  ;; length), although this part of the code is untested.
  ;;
  ;; I am not sure if the use of LISTEN here is really right.
  ;;
  (with-open-file (in f :direction :input
                      :element-type element-type
                      :external-format external-format)
    (let* ((l (file-length in))
           (buf (make-array (list l)
                            :element-type (stream-element-type in)
                            :adjustable t :fill-pointer t))
           (n (read-sequence buf in)))
      (cond ((< n l)
             ;; Just make the array seem a bit shorter: this is the
             ;; common case for things like UTF-8 and DOS line endings
             (adjust-array buf (list l) :fill-pointer n))
            ((and (= n l) (not (listen in)))
             ;; We got the exact length of the string and the stream
             ;; is at EOF.  So the string is fine as is: this will be
             ;; true for traditional Unix encodings where a character
             ;; is a byte and line endings are a single character.
             buf)
            (t
             ;; This is unexpected: the file is longer in characters
             ;; than it is in bytes.  This code is UNTESTED since the
             ;; only case I can engineer for it involves a race
             ;; between something which is appending to the file and
             ;; this code, and that test is too hard to set up.
             (labels ((get-more (start chunk-size)
                        (let ((size (+ start chunk-size)))
                          (adjust-array buf (list size) :fill-pointer size)
                          (let ((n (read-sequence buf in :start start)))
                            (cond ((< n chunk-size)
                                   ;; we're done: set the fill pointer
                                   ;; right and return
                                   (adjust-array buf (list size)
                                                 :fill-pointer (+ start n)))
                                  ((and (= n chunk-size) (not (listen in)))
                                   ;; We're also done: we got the
                                   ;; exact number of characters we
                                   ;; had allocated fortuitously
                                   buf)
                                  (t
                                   ;; there is more to get
                                   (get-more (+ start chunk-size) chunk-size)))))))
               (get-more l (ceiling (* l growth-factor)))))))))

Upvotes: 9

Related Questions