Reputation: 3871
I need to convert any arbitrary string:
to a valid Java identifier:
Is there an existing tool for this task?
With so many Java source refactoring/generating frameworks one would think this should be quite common task.
Upvotes: 12
Views: 7837
Reputation: 425198
This simple method will convert any input string into a valid java identifier:
public static String getIdentifier(String str) {
try {
return Arrays.toString(str.getBytes("UTF-8")).replaceAll("\\D+", "_");
} catch (UnsupportedEncodingException e) {
// UTF-8 is always supported, but this catch is required by compiler
return null;
}
}
Example:
"%^&*\n()" --> "_37_94_38_42_10_56_94_40_41_"
Any input characters whatsoever will work - foreign language chars, linefeeds, anything!
In addition, this algorithm is:
str1.equals(str2)
Thanks to Joachim Sauer for the UTF-8
suggestion
If collisions are OK (where it is possible for two inputs strings to produce the same result), this code produces a readable output:
public static String getIdentifier(String str) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if ((i == 0 && Character.isJavaIdentifierStart(str.charAt(i))) || (i > 0 && Character.isJavaIdentifierPart(str.charAt(i))))
sb.append(str.charAt(i));
else
sb.append((int)str.charAt(i));
}
return sb.toString();
}
It preserves characters that are valid identifiers, converting only those that are invalid to their decimal equivalents.
Upvotes: 12
Reputation: 719229
With so many Java source refactoring/generating frameworks one would think this should be quite common task.
Actually it is not.
A code refactoring framework will start with existing valid java identifiers, will be able to generate a new identifier by concatenating them with some additional characters for disambiguation purposes.
A typical code generation framework will start out with "names" taken from a restricted character set. It won't have to deal with arbitrary characters.
I presume that the aim of your converter is to produce identifiers that resemble the input strings if this is possible. If that's the case, I would do the conversion by mapping all legal identifier characters as-is, and replace illegal identifier characters with "$xxxx" where "xxxx" is a 4 digit hex encoding of the Java 16-bit character.
Your scheme works too, but replacing all illegal characters with '_' is more likely to result in identifier collisions; i.e. where two input strings map to the same identifier.
This is straight-forward to code, so I'll leave it for you to do.
Upvotes: 0
Reputation: 38526
If you are doing this for autogenerated code (i.e. don't care much about readability) one of my favorites is just to Base64 it. No need to play language lawyer over what characters are valid in what encodings, and it's a pretty common way of "protecting" arbitrary byte data.
Upvotes: 2
Reputation: 68972
I dont't know a tool for that purpose, but it can be easily created using the Character class.
Did you know that string€with_special_characters___ is a legal java identifier?
public class Conv {
public static void main(String[] args) {
String[] idents = { "string with spaces", "100stringsstartswithnumber",
"string€with%special†characters/\\!", "" };
for (String ident : idents) {
System.out.println(convert(ident));
}
}
private static String convert(String ident) {
if (ident.length() == 0) {
return "_";
}
CharacterIterator ci = new StringCharacterIterator(ident);
StringBuilder sb = new StringBuilder();
for (char c = ci.first(); c != CharacterIterator.DONE; c = ci.next()) {
if (c == ' ')
c = '_';
if (sb.length() == 0) {
if (Character.isJavaIdentifierStart(c)) {
sb.append(c);
continue;
} else
sb.append('_');
}
if (Character.isJavaIdentifierPart(c)) {
sb.append(c);
} else {
sb.append('_');
}
};
return sb.toString();
}
}
Prints
string_with_spaces
_100stringsstartswithnumber
string€with_special_characters___
_
Upvotes: 3