Mehdi mehrad
Mehdi mehrad

Reputation: 11

convert file utf8 to utf16 java

I'm trying to convert a file from UTF-8 to UTF-16 with a Java application

But my output turned out to be like this 蓘Ꟙ괠��Ꟙ돘ꨊ੕䥎潴楦楣慴楯渮瑩瑬攮佲摥牁摤敤乯瑩晩捡瑩潮偬畧楮㷘께뇛賘꼠���藙蘊啉乯瑩晩捡瑩潮⹬慢敬⹏牤敲䅤摥摎潴楦楣慴楯湐汵杩渽��藘귘뗙裙萠��藘꿛賘뇛賘ꨠ

Eventually, the output should be the same utf8= سلام utf16=\u0633\u0644\u0627\u0645

import java.io.*;

class WriteUTF8Data<inbytes> {
    WriteUTF8Data() throws UnsupportedEncodingException {
    }

    public static void main(String[] args) throws IOException {
        System.setProperty("file.encoding","UTF-8");

        byte[] inbytes = new byte[1024];

        FileInputStream fis = new FileInputStream("/home/mehrad/Desktop/PerkStoreNotification(1).properties");
        fis.read(inbytes);
        FileOutputStream fos = new FileOutputStream("/home/mehrad/Desktop/PerkStoreNotification(2).properties");
        String in = new String(inbytes, "UTF16");
        fos.write(in.getBytes());
    }
}

Upvotes: 1

Views: 243

Answers (2)

Jon Skeet
Jon Skeet

Reputation: 1503469

You're currently converting from UTF-16 into whatever your system default encoding is. If you want to convert from UTF-8, you need to specify that when you're converting the binary data. There are other issues with your code though - you're assuming that InputStream.read reads the whole buffer, and that that's all that's in the file. You'd probably be better using an Reader and a Writer, looping round and reading into a char array then writing the relevant part of that char array into the writer.

Here's some sample code that does that. It may well not be the best way of doing it these days, but it should at least work:

import java.io.*;
import java.nio.charset.*;
import java.nio.file.*;

public class ConvertUtf8ToUtf16 {

    public static void main(String[] args) throws IOException {
        Path inputPath = Paths.get(args[0]);
        Path outputPath = Paths.get(args[1]);

        char[] buffer = new char[4096];
        // UTF-8 is actually the default for Files.newBufferedReader,
        // but let's be explicit.
        try (Reader reader = Files.newBufferedReader(inputPath, StandardCharsets.UTF_8)) {
            try (Writer writer = Files.newBufferedWriter(outputPath, StandardCharsets.UTF_16)) {
                int charsRead;

                while ((charsRead = reader.read(buffer)) != -1) {
                    writer.write(buffer, 0, charsRead);
                }
            }
        }
    }
}

Upvotes: 3

Michael Gantman
Michael Gantman

Reputation: 7808

First of all answer by Jon Skeet is correct answer and will work. The problem with your code is that you convert incoming String into bytes according to your current encoding (I guess - UTF-8) and then try to create a new String with UTF-16 encoding from bytes that were produced as UTF-8 and that's why you get garbled output. Java keeps Strings internally in its own encoding (I think it is UCS-2). So when you have a String you can tell java to produce bytes from String in whatever charset you want. So for the same valid String method getBytes(UTF-8) and getBytes("UTF-16") would produce different sequence of bytes. So if you read your original content and you know that it is UTF-8 then you need to create String in UTF-8 String inString = new String(inbytes, "UTF-8") and then when you are writing produce your byte array from your String fos.write(inString.getBytes(UTF-16));

Also I would suggest to use this tool that would help you to understand the internal workings with String: It is a Utility that converts any String into unicode sequence and vice-versa.

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library that contains this Utility is called MgntUtils and can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc. Here is javadoc for the class StringUnicodeEncoderDecoder. Here is the link to an article that describes the MgntUtils Open source library: Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

Upvotes: 0

Related Questions