Reputation: 2813
I have a code that if executed from the slime prompt inside emacs run with no error. If I started sbcl from the prompt, I got the error:
* (ei:proc-file "BRAvESP000.log" "lixo")
debugger invoked on a SB-INT:STREAM-ENCODING-ERROR:
:UTF-8 stream encoding error on
#<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo"
the character with code 55357 cannot be encoded.
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [OUTPUT-NOTHING ] Skip output of this character.
1: [OUTPUT-REPLACEMENT] Output replacement string.
2: [ABORT ] Exit debugger, returning to top level.
(SB-IMPL::STREAM-ENCODING-ERROR-AND-HANDLE #<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo" {10049E8FF3}> 55357)
The mistery is that in both cases I am using the same sbcl 1.1.8 and the same machine, Mac OS 10.8.4. Any idea?
The code:
(defun proc-file (filein fileout &key (fn-convert #'identity))
(with-open-file (fout fileout
:direction :output
:if-exists :supersede
:external-format :utf8)
(with-open-file (fin filein :external-format :utf8)
(loop for line = (read-line fin nil)
while line
(let* ((line (ppcre:regex-replace "^.*{jsonTweet=" line "{\"jsonTweet\":"))
(data (gethash "jsonTweet" (yason:parse line))))
(yason:encode (funcall fn-convert (yason:parse data)) fout)
(format fout "~%"))
(end-of-file ()
(format *standard-output* "Error[~a]: ~a~%" filein line)))))))
Upvotes: 5
Views: 696
Reputation: 1069
This is almost certainly a bug in yason. JSON requires that if a non BMP character is escaped, it is done so through a surrogate pair. Here's a simple example with U+10000 (which is optionally escaped in json as "\ud800\udc00"; I use babel as babel's conversion is less strin):
(map 'list #'char-code (yason:parse "\"\\ud800\\udc00\""))
=> (55296 56320)
unicode code point 55296 (decimal) is the start for a surrogate pair, and should not appear except as a surrogate pair in UTF-16. Fortunately it can be easily worked around by using babel to encode the string to UTF-16 and back again:
(babel:octets-to-string (babel:string-to-octets (yason:parse "\"\\ud800\\udc00\"") :encoding :utf-16le) :encoding :utf-16le)
=> "𐀀"
You should be able to work around this by changing this line:
(yason:encode (funcall fn-convert (yason:parse data)) fout)
To use an intermediate string, which you convert to UTF-16 and back.
(with-output-to-string (outs)
(yason:encode (funcall fn-convert (yason:parse data)) outs))
:encoding :utf-16le)
:encoding :utf-16le)
I submitted a patch that has been accepted to fix this in yason:
Upvotes: 1