Akanksha Mehrotra
Akanksha Mehrotra

Reputation: 21

Reverse a utf-8 string in Java

I was trying to write a java code for reversing a utf-8 string in Java. I was asked this in an interview. However I was wondering if I convert the bytes to bits and get the codepoint from bits, how can we do the code then. Probably that is what the interviewer was looking for.

class Ideone
{

    public static void main(String[] args) {
        String s ="Ž®aͻ";char[] ch = new char[s.length()];
        StringBuilder sb = new StringBuilder(s);
        StringBuilder rev = new StringBuilder();

        for (int i=0; i< s.length(); i++) {

            int x = sb.codePointAt(i);
            char[] y = Character.toChars(x);
            rev.append(y);
        }
        System.out.println(rev.reverse());

    }

}

Upvotes: 1

Views: 4185

Answers (5)

dimo414
dimo414

Reputation: 48864

First, all Java Strings are UTF-16 encoded, not UTF-8. This is important for tasks like reversing strings, because the number of bytes a character will take up depends on the encoding. In UTF-8 the number of bytes is variable, whereas with UTF-16 it's always two bytes. A char is 16 bits of data, even if it's just representing ASCII. UTF-8 can encode ASCII in 8 bits, but may take more to represent other characters.

Because a char is 16 bits, most characters (including Ž®aͻ from your example) all fit nicely into individual chars, and there's no issues. However some characters (notably Emoji fall into this category) cannot be represented by a single char, and now we're dealing with surrogate pairs. You have to be very careful with string manipulations when dealing with text that might have surrogate pairs, because most Java APIs (notably almost every method on String) doesn't handle them properly.

For a better example, consider the string "👶👧👩👵💀🤖". Six characters, right? Not according to Java!

String s ="👶👧👩👵💀🤖";
System.out.println("String: " + s);
System.out.println("Length: " + s.length());
System.out.println("Chars:  " + Arrays.toString(s.toCharArray()));
System.out.println("Split:  " + Arrays.asList(s.split("")));

This prints:

String: 👶👧👩👵💀🤖
Length: 12
Chars:  [?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]
Split:  [?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?]

Now, some APIs do properly handle surrogate pairs, such as StringBuilder.reverse():

If there are any surrogate pairs included in the sequence, these are treated as single characters for the reverse operation. Thus, the order of the high-low surrogates is never reversed.

Assuming for the sake of the interview that you can't use this method (or, understandably, you can't recall on the spot whether it's safe or not), you can iterate over the code points of a String with String.codePoints(). This allows you to safely reverse the contents:

List<String> chars = s.codePoints()
    .mapToObj(i -> String.valueOf(Character.toChars(i)))
    .collect(Collectors.toList());
Collections.reverse(chars);
System.out.println(chars.stream().collect(Collectors.joining()));

Prints:

🤖💀👵👩👧👶

Upvotes: 6

Piyush
Piyush

Reputation: 1162

This should work.

String string = "Ž®aͻ";
String reverse = new StringBuilder(string).reverse().toString();
System.out.println(reverse);

Upvotes: 2

Tom Blodget
Tom Blodget

Reputation: 20812

First, you would convert whatever is meant by "utf-8 string" to a "utf-16 string" (java.lang.String), perhaps from a byte array of utf-8 code units by

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
decoder.onMalformedInput(CodingErrorAction.REPORT);

ByteBuffer buffer = ByteBuffer.wrap(bytes);
String s = decoder.decode(buffer).toString();
System.out.println(s);

So, you then have a sequence of UTF-16 code units, one or two of which encode a Unicode codepoint, one base codepoint and zero or more "combining" codepoints form a grapheme cluster, which is what most people would call a character when they see it. Presumably you want to reverse the order of grapheme clusters. Fortunately, Java provides various text break iterators, including one for grapheme breaks in a locale.

Locale locale = Locale.ENGLISH;

StringBuilder reversed = new StringBuilder();            
BreakIterator boundary = BreakIterator.getCharacterInstance(locale);
boundary.setText(s);
int end = boundary.last();
for (int start = boundary.previous();
         start != BreakIterator.DONE;
         end = start, start = boundary.previous()) {
    reversed.append(s.substring(start,end));
}
System.out.println(reversed.toString());

There is the story of a qualification exam where the first page says to read the whole exam before beginning and the last page says to write your name and turn it in without writing anything else. So, is the correct answer, there is no such thing as a "utf-8 string" in Java? If not, you'd have to ask what is meant by "utf-8 string."

And, there would still be questions about how to reverse things like ligatures. Should "fl" be reversed as "lf"?

You could also ask later, what type of project does this company do where reversing strings is important?

Upvotes: 4

Andreas
Andreas

Reputation: 159135

As dimo414 mentioned, StringBuilder.reverse() properly handles surrogate pairs:

If there are any surrogate pairs included in the sequence, these are treated as single characters for the reverse operation. Thus, the order of the high-low surrogates is never reversed.

That means that answer by Piyush is good, except it uses StringBuffer, which you shouldn't use.

If you insist on reversing a String (which is UTF-16, UTF-8) yourself, then you can do it like this code, which iterates the characters backwards, and handles surrogates (or drop the if statement if you don't care about surrogates):

private static String reverse(String input) {
    StringBuilder buf = new StringBuilder();
    for (int i = input.length() - 1; i >= 0; i--) {
        char c = input.charAt(i);
        if (i > 0 && Character.isSurrogate(c)) {
            char c2 = input.charAt(i - 1);
            if (Character.isSurrogate(c2)) {
                buf.append(c2);
                i--;
            }
        }
        buf.append(c);
    }
    return buf.toString();
}

However, your question said "utf-8 string", and UTF-8 is a byte encoding for strings, so if that's what you want, you first need to obtain the UTF-8 bytes, then reverse those bytes, and finally convert to back to a String:

private static String reverse(String input) {
    byte[] utf8bytes = input.getBytes(StandardCharsets.UTF_8);
    utf8bytes = reverseUtf8(utf8bytes);
    return new String(utf8bytes, StandardCharsets.UTF_8);
}

To reverse UTF-8, you need process backwards, and to do that, you need to understand how the encoding works.

Unicode characters in range 0 to 127 are encoded as one byte (i.e. bits 0xxxxxxx). All other Unicode characters are encoded as a block of bytes, starting with a 11xxxxxx byte, and the rest are 10xxxxxx bytes, so we can detect such blocks of bytes and retain them.

private static byte[] reverseUtf8(byte[] input) {
    byte[] reversed = new byte[input.length];
    for (int i = input.length - 1, j = 0; i >= 0; i--) {
        byte b = input[i];
        if ((b & 0x80) == 0) {
            reversed[j++] = b;
        } else {
            int k = i;
            while (k > 0 && (input[k] & 0xC0) == 0x80)
                k--;
            System.arraycopy(input, k, reversed, j, i - k + 1);
            j += i - k + 1;
            i = k;
        }
    }
    return reversed;
}

Upvotes: 4

Trash Can
Trash Can

Reputation: 6824

Not sure if this is acceptable, but you should take advantage of Java 8 feature to reverse the string like this

List<String> chars = Arrays.asList(oldStr.split(""));
Collections.reverse(chars);
String newStr = chars.stream()
                     .collect(Collectors.joining(""));

Upvotes: 0

Related Questions