parxier
parxier

Reputation: 3871

How to convert arbitrary string to Java identifier?

I need to convert any arbitrary string:

to a valid Java identifier:

Is there an existing tool for this task?

With so many Java source refactoring/generating frameworks one would think this should be quite common task.

Upvotes: 12

Views: 7837

Answers (4)

Bohemian
Bohemian

Reputation: 425198

This simple method will convert any input string into a valid java identifier:

public static String getIdentifier(String str) {
    try {
        return Arrays.toString(str.getBytes("UTF-8")).replaceAll("\\D+", "_");
    } catch (UnsupportedEncodingException e) {
        // UTF-8 is always supported, but this catch is required by compiler
        return null;
    }
}

Example:

"%^&*\n()" --> "_37_94_38_42_10_56_94_40_41_"

Any input characters whatsoever will work - foreign language chars, linefeeds, anything!
In addition, this algorithm is:

  • reproducible
  • unique - ie will always and only produce the same result if str1.equals(str2)
  • reversible

Thanks to Joachim Sauer for the UTF-8 suggestion


If collisions are OK (where it is possible for two inputs strings to produce the same result), this code produces a readable output:

public static String getIdentifier(String str) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        if ((i == 0 && Character.isJavaIdentifierStart(str.charAt(i))) || (i > 0 && Character.isJavaIdentifierPart(str.charAt(i))))
            sb.append(str.charAt(i));
        else
            sb.append((int)str.charAt(i));
    }
    return sb.toString();
}

It preserves characters that are valid identifiers, converting only those that are invalid to their decimal equivalents.

Upvotes: 12

Stephen C
Stephen C

Reputation: 719229

With so many Java source refactoring/generating frameworks one would think this should be quite common task.

Actually it is not.

  • A code refactoring framework will start with existing valid java identifiers, will be able to generate a new identifier by concatenating them with some additional characters for disambiguation purposes.

  • A typical code generation framework will start out with "names" taken from a restricted character set. It won't have to deal with arbitrary characters.


I presume that the aim of your converter is to produce identifiers that resemble the input strings if this is possible. If that's the case, I would do the conversion by mapping all legal identifier characters as-is, and replace illegal identifier characters with "$xxxx" where "xxxx" is a 4 digit hex encoding of the Java 16-bit character.

Your scheme works too, but replacing all illegal characters with '_' is more likely to result in identifier collisions; i.e. where two input strings map to the same identifier.

This is straight-forward to code, so I'll leave it for you to do.

Upvotes: 0

Steven Schlansker
Steven Schlansker

Reputation: 38526

If you are doing this for autogenerated code (i.e. don't care much about readability) one of my favorites is just to Base64 it. No need to play language lawyer over what characters are valid in what encodings, and it's a pretty common way of "protecting" arbitrary byte data.

Upvotes: 2

stacker
stacker

Reputation: 68972

I dont't know a tool for that purpose, but it can be easily created using the Character class.

Did you know that string€with_special_characters___ is a legal java identifier?

public class Conv {
    public static void main(String[] args) {
        String[] idents = { "string with spaces", "100stringsstartswithnumber",
                "string€with%special†characters/\\!", "" };
        for (String ident : idents) {
            System.out.println(convert(ident));
        }
    }

    private static String convert(String ident) {
        if (ident.length() == 0) {
            return "_";
        }
        CharacterIterator ci = new StringCharacterIterator(ident);
        StringBuilder sb = new StringBuilder();
        for (char c = ci.first(); c != CharacterIterator.DONE; c = ci.next()) {
            if (c == ' ')
                c = '_';
            if (sb.length() == 0) {
                if (Character.isJavaIdentifierStart(c)) {
                    sb.append(c);
                    continue;
                } else
                    sb.append('_');
            }
            if (Character.isJavaIdentifierPart(c)) {
                sb.append(c);
            } else {
                sb.append('_');
            }
        };
        return sb.toString();
    }
}

Prints

string_with_spaces
_100stringsstartswithnumber
string€with_special_characters___
_

Upvotes: 3

Related Questions