levin li
levin li

Reputation: 401

How to read utf-8 string by usocket

When I was reading from a usocket stream using the code below:

(let ((stream (socket-stream sk)) line)
  (loop for line = (read-line stream)
     while line do (format t line)))

when read-line meets an non-ascii charactor, it throw out an exception:

decoding error on stream
#<SB-SYS:FD-STREAM
  for "socket 118.229.141.195:52946, peer: 119.75.217.109..."
  {BCA02F1}>
(:EXTERNAL-FORMAT :UTF-8):
  the octet sequence (176) cannot be decoded.
   [Condition of type SB-INT:STREAM-DECODING-ERROR]

Neither read-line nor read-byte works, so I tried to use trivial-utf-8 to read utf-8 string using read-utf-8-string, but It only accepts a binary stream, it seems socket-stream does not create a binary stream, so I was confused how to read from a socket stream that has non-ascii charactors?

Upvotes: 2

Views: 582

Answers (3)

Matthias Benkard
Matthias Benkard

Reputation: 15759

The error you're getting indicates that the data you're trying to read is not actually valid UTF-8 data. Indeed, 176 (= #b10110000) is not a byte that can introduce a UTF-8 character. If the data you're trying to read is in some other encoding, try adjusting your Lisp compiler's external format setting accordingly or using Babel or FLEXI-STREAMS to decode the data.

Upvotes: 1

user797257
user797257

Reputation:

Once I needed it and I was lazy to look for a library to do it, so I did it myself :) It may not be the best way, but I only needed something for a fast and not complicated, so here it goes:

(defun read-utf8-char (stream)
  (loop for i from 7 downto 0
     with first-byte = (read-byte stream nil 0)
     do (when (= first-byte 0) (return +null+))
     do (when (or (not (logbitp i first-byte)) (= i 0))
          (setf first-byte (logand first-byte (- (ash 1 i) 1)))
              (return
            (code-char 
             (dotimes (a (- 6 i) first-byte)
               (setf first-byte
                     (+ (ash first-byte 6)
                        (logand (read-byte stream) #x3F)))))))))

Upvotes: 0

Vsevolod Dyomkin
Vsevolod Dyomkin

Reputation: 9451

You can first read-sequence (if you know the length ahead of time) or read-bytes while there are some, and then convert them to string with (babel:octets-to-string octets :encoding :utf-8)) (where octets is (make-array expected-length :element-type '(unsigned-byte 8))).

Upvotes: 1

Related Questions