Reputation: 836
I encountered a bug where I couldn't match two seemingly 'identical' strings together. For example, the following two strings fail to match: "sample" and "sample".
To replicate the issue, one can run the following in Clojure.
(= "sample" "sample") ; returns false
After an hour of frustrated debugging, I discovered that there was a zero-width space at the front of the second string! Removing it from this particular example via a backspace is trivial. However I have a database of strings that I'm matching, and it seems like there are multiple strings facing this issue. My question is: is there a general method to trim zero-width spaces in Clojure?
Some method's I've tried:
(count (clojure.string/trim "abc")) ; returns 4
(count (clojure.string/replace "abc" #"\s" "")) ; returns 4
This thread Remove zero-width space characters from a JavaScript string does provide a solution with regular expressions that works in this example, i.e.
(count (clojure.string/replace "abc" #"[\u200B-\u200D\uFEFF]" "")) ; returns 3
However, as stated in the post itself, there are many other potential ascii characters that may be invisible. So I'm still interested if there's a more general method that doesn't rely on listing all possible invisible unicode symbols.
Upvotes: 6
Views: 2024
Reputation: 29966
The regex solution from @Rulle is very nice. The tupelo.chars namespace also has a collection of character classes and predicate functions that could be useful. They work in Clojure and ClojureScript, and also include the ^nbsp;
for browsers. In particular, check out the visible? predicate.
The tupelo.string namespace also has a number of helper & convenience functions for string processing.
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.chars :as chars]
[tupelo.string :as str] ))
(def sss
"Some multi-line
string." )
(dotest
(println "result:")
(println
(str/join
(filterv
#(or (chars/visible? %)
(chars/whitespace? %))
sss))))
with result
result:
Some multi-line
string.
To use, make your project.clj
look like:
:dependencies [
[org.clojure/clojure "1.10.2-alpha1"]
[prismatic/schema "1.1.12"]
[tupelo "20.07.01"]
]
Upvotes: 1
Reputation: 4901
I believe, what you are referring to are so-called non-printable characters. Based on this answer in Java, you could pass the #"\p{C}"
regular expression as pattern to replace
:
(defn remove-non-printable-characters [x]
(clojure.string/replace x #"\p{C}" ""))
However, this will remove line breaks, e.g. \n
. So in order to keep those characters, we need a more complex regular expression:
(defn remove-non-printable-characters [x]
(clojure.string/replace x #"[\p{C}&&^(\S)]" ""))
This function will remove non-printable characters. Let's test it:
(= "sample" "sample")
;; => false
(= (remove-non-printable-characters "sample")
(remove-non-printable-characters "sample"))
;; => true
(remove-non-printable-characters "sam\nple")
;; => "sam\nple"
The \p{C}
pattern is discussed here.
Upvotes: 4