Rebronja
Rebronja

Reputation: 355

Removing accents and diacritics in kotlin

Is there any way to convert string like 'Dziękuję' to 'Dziekuje' or 'šećer' to 'secer' in kotlin. I have tried using java.text.Normalizer but it doesn't seem to work the desired way.

Upvotes: 25

Views: 16529

Answers (5)

Jorgesys
Jorgesys

Reputation: 126573

You need only this method using Normalizer class:

   fun String.removeAccents() =
    Normalizer.normalize(this, Normalizer.Form.NFD)

Example:

 var words ="Dziękuję šećer aębšc áéíóů canción ñiña Ioana"
 println("${words.unaccent()}")

output:

Dziekuje secer aebsc aeiou cancion nina Ioana

Upvotes: 0

David Miguel
David Miguel

Reputation: 14480

TL;DR:

  1. Use Normalizer to canonically decomposed the Unicode thext.
  2. Remove non-spacing combining characters (\p{Mn}).

fun String.removeNonSpacingMarks() =
    Normalizer.normalize(this, Normalizer.Form.NFD)
    .replace("\\p{Mn}+".toRegex(), "")

Long answer:

Using Normalizer you can transform the original text into an equivalent composed or decomposed form.

  • NFD: Canonical decomposition.
  • NFC: Canonical decomposition, followed by canonical composition.

Canonical Composites.
(more info about normalization can be found in the Unicode® Standard Annex #15)

In our case, we are interested in NFD normalization form because it allows us to separate all the combined characters from the base character.

After decomposing the text, we have to run a regex to remove all the new characters resulting from the decomposition that correspond to combined characters.

Combined characters are special characters intended to be positioned relative to an associated base character. The Unicode Standard distinguishes two types of combining characters: spacing and nonspacing.

We are only interested in non-spacing combining characters. Diacritics are the principal class (but not the only one) of this group used with Latin, Greek, and Cyrillic scripts and their relatives.

To remove non-spacing characters with a regex we have to use \p{Mn}. This group includes all the 1,826 non-spacing characters.

Other answers uses \p{InCombiningDiacriticalMarks}, this block only includes combining diacritical marks. It is a subset of \p{Mn} that includes only 112 characters.

Upvotes: 23

Thiago Silva
Thiago Silva

Reputation: 796

In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:

fun stripAccents(s: String):String{

if (s == null) {
        return "";
    }

    val chars: CharArray = s.toCharArray()

    var sb = StringBuilder(s)
    var cont: Int = 0

    while (chars.size > cont) {
        var c: kotlin.Char
        c = chars[cont]
        var c2:String = c.toString()
       //these are my needs, in case you need to convert other accents just Add new entries aqui
        c2 = c2.replace("Ã", "A")
        c2 = c2.replace("Õ", "O")
        c2 = c2.replace("Ç", "C")
        c2 = c2.replace("Á", "A")
        c2 = c2.replace("Ó", "O")
        c2 = c2.replace("Ê", "E")
        c2 = c2.replace("É", "E")
        c2 = c2.replace("Ú", "U")

        c = c2.single()
        sb.setCharAt(cont, c)
        cont++

    }

    return sb.toString()

}

to use these fun cast the code like this:

var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()

str = stripAccents(str) //call the function

Upvotes: -4

Eugen Pechanec
Eugen Pechanec

Reputation: 38243

Normalizer only does half the work. Here's how you could use it:

private val REGEX_UNACCENT = "\\p{InCombiningDiacriticalMarks}+".toRegex()

fun CharSequence.unaccent(): String {
    val temp = Normalizer.normalize(this, Normalizer.Form.NFD)
    return REGEX_UNACCENT.replace(temp, "")
}

assert("áéíóů".unaccent() == "aeiou")

And here's how it works:

We are calling the normalize(). If we pass à, the method returns a + ` . Then using a regular expression, we clean up the string to keep only valid US-ASCII characters.

Source: http://www.rgagnon.com/javadetails/java-0456.html

Note that Normalizer is a Java class; this is not pure Kotlin and it will only work on JVM.

Upvotes: 58

user8959091
user8959091

Reputation:

This is an extension function you can use and extend further:

fun String.normalize(): String {
    val original = arrayOf("ę", "š")
    val normalized =  arrayOf("e", "s")

    return this.map { it ->
        val index = original.indexOf(it.toString())
        if (index >= 0) normalized[index] else it
    }.joinToString("")
}

Use it like this:

val originalText = "aębšc"
val normalizedText = originalText.normalize()
println(normalizedText)

will print

aebsc

Extend the arrays original and normalized with as many elements as you need.

Upvotes: 4

Related Questions