Reputation:
I have a String "b\u00f4lovar" and i was wondering if it's possible to unescape without use Commons-lang. It works but i'm facing a problem on some enviroments and i would like to minimize it (i.e.: it works on my machine but not works on production).
StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))
How can i unescape it without apache lib?
Thank in advance.
Upvotes: 3
Views: 1582
Reputation: 14224
If you want to unescape only sequences in the format \u0000
than it is simple to do it with a single regex replace:
def unescapeUnicode(str: String): String =
"""\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar match {
case '\\' => """\\"""
case '$' => """\$"""
case c => c.toString
})
And the result is
scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ
We have to process characters $
and \
separately, because they are treated as special by the java.util.regex.Matcher.appendReplacement
method:
def wrongUnescape(str: String): String =
"""\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
... 46 elided
scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
... 46 elided
Unicode character escapes are a bit special: they are not a part of string literals, but a part of the program code. There is a separate phase to replace unicode escapes with characters:
scala> Integer.toString('a', 16)
res2: String = 61
scala> val \u0061 = "foo"
a: String = foo
scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = " "
There is a function StringContext.treatEscapes
in Scala library, that supports all normal escapes from the language specification.
So if you want to support unicode escapes and all normal Scala escapes, you can unescape both sequentially:
def unescape(str: String): String =
StringContext.treatEscapes(unescapeUnicode(str))
scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b
scala> unescape("\\u005ct")
res5: String = " "
Upvotes: 3