Desmond Cheong
Desmond Cheong

Reputation: 836

General method to trim non-printable characters in Clojure

I encountered a bug where I couldn't match two seemingly 'identical' strings together. For example, the following two strings fail to match: "sample" and "​sample".

To replicate the issue, one can run the following in Clojure.

(= "sample" "​sample") ; returns false

After an hour of frustrated debugging, I discovered that there was a zero-width space at the front of the second string! Removing it from this particular example via a backspace is trivial. However I have a database of strings that I'm matching, and it seems like there are multiple strings facing this issue. My question is: is there a general method to trim zero-width spaces in Clojure?

Some method's I've tried:

(count (clojure.string/trim "​abc")) ; returns 4
(count (clojure.string/replace "​abc" #"\s" "")) ; returns 4

This thread Remove zero-width space characters from a JavaScript string does provide a solution with regular expressions that works in this example, i.e.

(count (clojure.string/replace "​abc" #"[\u200B-\u200D\uFEFF]" "")) ; returns 3

However, as stated in the post itself, there are many other potential ascii characters that may be invisible. So I'm still interested if there's a more general method that doesn't rely on listing all possible invisible unicode symbols.

Upvotes: 6

Views: 2024

Answers (2)

Alan Thompson
Alan Thompson

Reputation: 29966

The regex solution from @Rulle is very nice. The tupelo.chars namespace also has a collection of character classes and predicate functions that could be useful. They work in Clojure and ClojureScript, and also include the ^nbsp; for browsers. In particular, check out the visible? predicate.

The tupelo.string namespace also has a number of helper & convenience functions for string processing.

(ns tst.demo.core
  (:use tupelo.core tupelo.test)
  (:require
    [tupelo.chars :as chars]
    [tupelo.string :as str] ))

(def sss
"Some multi-line
string." )

(dotest
  (println "result:")
  (println
    (str/join
      (filterv
        #(or (chars/visible? %) 
             (chars/whitespace? %))
        sss))))

with result

result:
Some multi-line
string.

To use, make your project.clj look like:

  :dependencies [
                 [org.clojure/clojure "1.10.2-alpha1"]
                 [prismatic/schema "1.1.12"]
                 [tupelo "20.07.01"]
                 ]

Upvotes: 1

Rulle
Rulle

Reputation: 4901

I believe, what you are referring to are so-called non-printable characters. Based on this answer in Java, you could pass the #"\p{C}" regular expression as pattern to replace:

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"\p{C}" ""))

However, this will remove line breaks, e.g. \n. So in order to keep those characters, we need a more complex regular expression:

(defn remove-non-printable-characters [x]
  (clojure.string/replace x #"[\p{C}&&^(\S)]" ""))

This function will remove non-printable characters. Let's test it:

(= "sample" "​sample")
;; => false

(= (remove-non-printable-characters "sample")
   (remove-non-printable-characters "​sample"))
;; => true

(remove-non-printable-characters "sam\nple")
;; => "sam\nple"

The \p{C} pattern is discussed here.

Upvotes: 4

Related Questions