Scott M
Scott M

Reputation: 1

JAVA Code To Identify A String With UTF-8 data

I am trying to standardize a set of data. Some of the names were UTF-8 encoded, others were not. What I need to do in JAVA is detect if the name was UTF encoded or not using some form of conditional logic so I can translate each row correctly.

String s1 = "José Flores";
String s1 = "José Flores";

IF [condition] (identify UTF-8)
    byte[] utf8Bytes = s1.getBytes("ISO-8859-1");
    String s2 = new String(utf8Bytes,"UTF-8");
ELSE
    String s2 = s1;

Upvotes: 0

Views: 413

Answers (1)

parlad
parlad

Reputation: 1163

With the help of juniversalchardet , you can get the encoding , then do the condition operation. This could help you get encoding type.

public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
    new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
if (encoding == null) {
    encoding = DEFAULT_ENCODING;
  }
 return encoding;
}

This require juniversalchardet-1.0.3.jar, Also here are some info

Upvotes: 1

Related Questions