Sim
Sim

Reputation: 4184

how to replace any escape character (e.g. vertical tab) inside a given string

I have to deal with certain files that may contain escape characters like vertical-tab (aka "^k"), this does mess with the REPL (SBCL) and some libraries (e.g. cxml-stp).

Is there a reference covering those characters in CL and how can I filter them? I just found some emacs-lisp reference, but most certainly those do not really work for common lisp as far as I was able to confirm.

Upvotes: 2

Views: 1172

Answers (3)

Michael H.
Michael H.

Reputation: 902

The trouble with using REMOVE-IF is, as you note, that there's not a nice handle on the character you're looking for. Search the Hyperspec for "semi-standard characters", and you'll see the list of named whitespace characters is pretty short -- #\Newline, #\Space, #\Tab, #\Return, #\Page, #\Rubout, #\Backspace.

If you're using Emacs as your editor, it's not too hard to find the target value, cut it, and paste it into your code as a literal -- but that's not a great idea either. On the other hand, if you can find the character and paste it into your REPL, then you should be able to find a way to call (char-name) on it, ala: (char-name (aref "<copy-paste-char>" 0)).

I'd do something like what wvxvw is doing, except more exploratory. Write code to walk over your file, collect all characters in use, print their codes and their names. (Don't just print every one; count the number of occurrences using a hashtable, i.e. (incf (gethash <char> ht 0)), so that you can get a sense for how frequently things occur, and you're not overwhelmed with output.) Then you can make a more informed decision on how to identify & eliminate characters you don't want in your file.

Upvotes: 1

user797257
user797257

Reputation:

(defun sanitize (string)
  (remove-if
   #'(lambda (x)
       (and (< x 32)
            (not (or (= x 13) (= x 10)))))
   string))

(with-output-to-string (s)
    (let ((sanitized 
          (sanitize
           (do ((a (make-array 100 :element-type '(unsigned-byte 8)))
                (i 0 (1+ i)))
               (nil)
             (when (= i (length a))
               (return a))
             (setf (aref a i) (random 64))))))
      (dotimes (i (length sanitized))
        (princ (code-char (aref sanitized i)) s))))

But it might depend on your source / what characters exactly do you want to include. This would work for ASCII - if you are guaranteed to have them in (mod 128) format. Unicode is a much more complex question. This, however, will still filter out characters that might have special meaning in shell script, but this is not a good idea for constructing Unicode strings in formats such as UTF-8, because if your source comes as bytes, you will need to parse it and make sure those form valid UTF-whatever format. You would also need to take care of the alternative (redundant) representation possible in Unicode, pairs formed by diacritics combined with letters. Blank areas in codepoints ranges and so on...

To tell you the truth, I haven't seen yet a Lisp that has 100% conforming Unicode implementation. It is more difficult than it sounds, and you probably only need a subset of it anyway.

If you want practical cases when this was not a good idea for Unicode strings - google for "directory attacks" and IIS5 vulnerability.

Upvotes: 2

Vatine
Vatine

Reputation: 21258

REMOVE-IF with a suitable test function should do the trick.

Upvotes: 1

Related Questions