leoismyname
leoismyname

Reputation: 407

Written UTF-16 character to a file by using UTF-8 charset output stream in java but resulting data on file is still UTF-16. why?

Created a simple java program to see if utf 8 charset can save utf16 character, and it does able to save it. Why? If UTF-08 can save UTF-16 characters than whats the difference in using UTF-16 and UTF-8.

both test character's unicode value is beyond UTF-8 range i.e. 256.

✈ unicode value: 9992
❄ unicode value: 10052

please see the sample program:-

import java.io.*;
import java.nio.charset.Charset;

public class UTFSizeTest {

    public static void main(String[] args) throws IOException {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        write("UTF-16");
        write("UTF-8");
        write(null);
    }

    private static void write(String utf) throws IOException {
        final String fileName = "someFile" + utf;

        Writer writer;

        if (utf == null) {
            writer = new OutputStreamWriter(new FileOutputStream(fileName));
        } else {
            writer = new OutputStreamWriter(new FileOutputStream(fileName), utf);
        }


        for (int i = 0; i < 2; i++) {
            writer.write("✈ ❄");
            writer.write("\n");
        }

        writer.close();

        System.out.println(fileName + " size: "+ new File(fileName).length());
    }
}

Data written same on both files using utf-16 and utf-8 :-
✈ ❄
✈ ❄

Size of the files is also almost the same for UTF-16 and UTF-8 as can seen on console output.
Console output is following:-
Default Charset=UTF-8
someFileUTF-16 size: 18
someFileUTF-8 size: 16
someFilenull size: 16

If utf-08 can save 16 bits unicode just fine fine than why to use uff-16 in java.

thank you.

Upvotes: 1

Views: 5045

Answers (2)

Remy Lebeau
Remy Lebeau

Reputation: 596632

Created a simple java program to see if utf 8 charset can save utf16 character

It can. UTF-8 and UTF-16 are just different encodings for the same Unicode character set. Both encodings are designed to support all Unicode codepoints, both present and the foreseeable future.

and it does able to save it. Why?

Because they both support the same Unicode codepoints. Converting between the various UTFs is a loss-less operation, by design.

If UTF-08 can save UTF-16 characters than whats the difference in using UTF-16 and UTF-8.

UTF-8 is primarily preferred over UTF-16 because:

  1. UTF-8 is backwards compatible with 7bit ASCII, so a lot of legacy code can be migrated to UTF-8 without breaking.

  2. For most languages, particularly latin-based ones, UTF-8 is more compact than UTF-16, thus saving memory, disk space, and bandwidth. However, there are cases, primarily Asian languages, but also symbols (as in your example) where UTF-16 is actually more compact than UTF-8.

please see the sample program:-
...
Data written same on both files using utf-16 and utf-8 :-

Yes, they represent the same Unicode codepoints, so they are rendered the same by a Unicode-aware text viewer/editor. But their physical bytes are very different:

✈
UTF-8:    e2 9c 88
UTF-16LE: 08 27
UTF-16BE: 27 08

❄
UTF-8:    e2 9d 84
UTF-16LE: 44 27
UTF-16BE: 27 44

Size of the files is also almost the same for UTF-16 and UTF-8 as can seen on console output.

In the above example, you have chosen 2 Unicode codepoints that do not require UTF-16 surrogate pairs to encode them, so they use 2 bytes instead of 4 bytes in UTF-16. In UTF-8, the take 3 bytes each, but the size difference is reduced by the 1-byte U+0020 SPACE character in between them. Try writing longer strings with a bigger mix of low and high codepoint values, and you should see a much wider variation in file sizes.

If utf-08 can save 16 bits unicode just fine fine than why to use uff-16 in java.

Although UTF-8 and UTF-16 are both variable-length encodings, UTF-16 tends to have less-variable lengths than UTF-8. All of the codepoints within UTF-8's 1-, 2-, and 3-byte formats fit within UTF-16's 2-byte format, making UTF-16 closer to fixed-length than UTF-8. That is also means UTF-16 is easier to seek forwards and (particularly) backwards within, you only have to jump 2 or 4 bytes per codepoint, whereas with UTF-8 you have to jump 1, 2, 3, or 4 bytes per codepoint, so decoding logic is a bit more complex in UTF-8 than UTF-16.

Keep in mind that when Java, Windows, etc adopted Unicode, it was before UTF-16 existed, when all available codepoints at the time easily fit within UCS-2, which is a fixed-length encoding. It wasn't until later on that Unicode outgrew UCS-2 and UTF-16 was invented to replace it. By then, it was too late to re-write code that had been migrated to Unicode, so UTF-16 had to maintain backwards compatibility with UCS-2. Besides, much Unicode data used in the real world still tends to fit within UCS-2, only higher codepoints really require the extra bytes used to encode UTF-16 surrogates.

So, that usually makes UTF-16 a more suitable choice for processing data. It is a better compromise between memory usage and processing overhead than UTF-8, at least when dealing with non-ASCII characters. But UTF-8 is backwards compatible with ASCII, and it tends to be a more suitable format for storing and exchanging data.

Upvotes: 4

leoismyname
leoismyname

Reputation: 407

I asked this question because of my ignorance. I though UTF-8 can only save character point upto 8 bits and UTF-16 is required for Unicode or Unicode characters means character represented by 2 bytes or 16 bits.

But after reading through some forums i realized that UTF-8, UTF-16 and UTF-32 are all different encoding styles for the Unicode characters and in fact UTF-8 can represent the character upto 6 bytes/48 bits.

thanks

Upvotes: 0

Related Questions