Sujit Devkar
Sujit Devkar

Reputation: 1215

Find out number of characters in a UTF-8 string in Java/Android

I am trying to find out a string length when the string is stored in UTF-8. I tried following approach:

String str = "मेरा नाम";
Charset UTF8_CHARSET = Charset.forName("UTF-8");
byte[] abc = str.getBytes(UTF8_CHARSET);
int length = abc.length;

This gives me length of the byte array, but not number of characters in the string.

I found a website which shows both UTF-8 string length and byte length. https://mothereff.in/byte-counter Let's say my string is मेरा नाम, then I should get string length as 8 characters and not 22 bytes.

Could anyone please guide on this.

Upvotes: 7

Views: 9215

Answers (6)

Joop Eggen
Joop Eggen

Reputation: 109613

The shortest "length" is in Unicode code points, as notion of numbered character, UTF-32.

Correction: As @liudongmiao mentioned probably one should use:

int length = string.codePointCount(0, string.length);

In Java 8:

int length = (int) string.codePoints().count();

Prior javas:

int length(String s) {
   int n = 0;
   for (int i = 0; i < s.length(); ++n) {
       int cp = s.codePointAt(i);
       i += Character.charCount(cp);
   }
   return n;
}

A Unicode code point can be encoded in UTF-16 as one or two chars.

The same Unicode character might have diacritical marks. They can be written as separate code points: basic letter + zero or more diacritical marks. To normalize the string to one (C=) compressed code point:

string = java.text.Normalizer.normalize(string, Normalizer.Form.NFC);

BTW for database purposes, the UTF-16 length seems more useful:

string.length() // Number of UTF-16 chars, every char two bytes.

(In the example mentioned UTF-32 length == UTF-16 length.)


A dump function

A commenter had some unexpected result:

void dump(String s) {
   int n = 0;
   for (int i = 0; i < s.length(); ++n) {
       int cp = s.codePointAt(i);
       int bytes = Character.charCount(cp);
       i += bytes;
       System.out.printf("[%d] #%dB: U+%X = %s%n",
           n, bytes, cp, Character.getName(cp));
   }
   System.out.printf("Length:%d%n", n);
}

Upvotes: 8

L&#234; Quốc Anh
L&#234; Quốc Anh

Reputation: 41

In UTF-8 String.length() returns number of characters. If you want to get number of bytes, you can use String.getBytes().length

For example:

String str = "アンドリューは本当に凄いですだと";

System.out.println(str.length()); // display 16 corresponding to 16 characters System.out.println(str.getBytes().length); // display 48 corresponding to 48 bytes

Upvotes: 1

Aleksander Wons
Aleksander Wons

Reputation: 3967

Take a look at the http://rosettacode.org/wiki/String_length#Grapheme_Length_4:

import java.text.BreakIterator;

public class Grapheme {
  public static void main(String[] args) {
    printLength("møøse");
    printLength("𝔘𝔫𝔦𝔠𝔬𝔡𝔢");
    printLength("J̲o̲s̲é̲");
  }

  public static void printLength(String s) {
    BreakIterator it = BreakIterator.getCharacterInstance();
    it.setText(s);
    int count = 0;
    while (it.next() != BreakIterator.DONE) {
      count++;
    }
    System.out.println("Grapheme length: " + count+ " " + s);
  }
}

Output:

Grapheme length: 5 møøse
Grapheme length: 7 𝔘𝔫𝔦𝔠𝔬𝔡𝔢
Grapheme length: 4 J̲o̲s̲é̲

What you are looking for is not string length bu grapeme length. It gives you the number of "visible" characters.

Upvotes: 4

mushfek0001
mushfek0001

Reputation: 3935

String.length() actually returns the number of characters in a string encoded in UTF-16 (where two bytes are used to encode a character). However this should work for most UTF-8 chars too unless you have a character with an ASCII value greater than 127. If you want to do things by hand without encoding it to UTF-8, you can do something like this

public static int utf8Length(CharSequence sequence) {
        int count = 0;
        for (int i = 0; i < sequence.length(); i++) {
            char ch = sequence.charAt(i);
            if (ch <= 0x7F) {
                count++;
            } else if (ch <= 0x7FF) {
                count += 2;
            } else if (Character.isHighSurrogate(ch)) {
                count += 4;
                ++i;
            } else {
                count += 3;
            }
        }
        return count;
    }

Here's the UTF-8 spec.

Upvotes: 2

nipuna-g
nipuna-g

Reputation: 6662

Rather than converting password[0] to a byte array you can simply run

password[0].length();

You can also convert the bytearray back to a string then run the lenght method on it as well.

    byte[] abc = password[0].getBytes(UTF8_CHARSET);
    String s1 = new String(abc, "UTF-8");
    System.out.println(s1.length());

Upvotes: 0

Prashant
Prashant

Reputation: 2614

simply you save your program as utf-8 and do as below

        String str= "मेरा नाम";
        System.out.println(str.length());

o/p = 8

Upvotes: -2

Related Questions