user866364
user866364

Reputation:

Scala - unescape Unicode String without Apache

I have a String "b\u00f4lovar" and i was wondering if it's possible to unescape without use Commons-lang. It works but i'm facing a problem on some enviroments and i would like to minimize it (i.e.: it works on my machine but not works on production).

StringEscapeUtils.unescapeJava(variables.getOrElse("name", ""))

How can i unescape it without apache lib?

Thank in advance.

Upvotes: 3

Views: 1582

Answers (1)

Kolmar
Kolmar

Reputation: 14224

Only Unicode escapes

If you want to unescape only sequences in the format \u0000 than it is simple to do it with a single regex replace:

def unescapeUnicode(str: String): String =
  """\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar match {
      case '\\' => """\\"""
      case '$' => """\$"""
      case c => c.toString
    })

And the result is

scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ

We have to process characters $ and \ separately, because they are treated as special by the java.util.regex.Matcher.appendReplacement method:

def wrongUnescape(str: String): String =
  """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
    m => Integer.parseInt(m.group(1), 16).toChar.toString)

scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
  at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
  ... 46 elided

scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
   at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
   ... 46 elided

All escape characters

Unicode character escapes are a bit special: they are not a part of string literals, but a part of the program code. There is a separate phase to replace unicode escapes with characters:

scala> Integer.toString('a', 16)
res2: String = 61

scala> val \u0061 = "foo"
a: String = foo

scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = "    " 

There is a function StringContext.treatEscapes in Scala library, that supports all normal escapes from the language specification.

So if you want to support unicode escapes and all normal Scala escapes, you can unescape both sequentially:

def unescape(str: String): String =
  StringContext.treatEscapes(unescapeUnicode(str))

scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b

scala> unescape("\\u005ct")
res5: String = "    "

Upvotes: 3

Related Questions