Reputation: 3458
do you know function in java that will validate a string to be a good XML element name.
Form w3schools:
XML elements must follow these naming rules:
- Names can contain letters, numbers, and other characters
- Names cannot start with a number or punctuation character
- Names cannot start with the letters xml (or XML, or Xml, etc)
- Names cannot contain spaces
I found other questions that offered regex solutions, isn't there a function that already does that?
Upvotes: 12
Views: 13752
Reputation: 9584
As a current addition to the accepted answer:
At least Oracle's JDK 1.8 (probably older ones as well) use the Xerces parser internally in the non-public com.sun.*
packages. You should never directly use any implementations from those classes as they may change without further notice in future versions of the JDK! However, the required code for the xml element name validity check is very well encapsulated and can be copied out to your own code. This way, you can avoid another dependency to an external library.
This is the required code taken from the internal class com.sun.org.apache.xerces.internal.util.XMLChar
:
public class XMLChar {
/** Character flags. */
private static final byte[] CHARS = new byte[1 << 16];
/** Name start character mask. */
public static final int MASK_NAME_START = 0x04;
/** Name character mask. */
public static final int MASK_NAME = 0x08;
static {
// Initializing the Character Flag Array
// Code generated by: XMLCharGenerator.
CHARS[9] = 35;
CHARS[10] = 19;
CHARS[13] = 19;
// ...
// the entire static block must be copied
}
/**
* Check to see if a string is a valid Name according to [5]
* in the XML 1.0 Recommendation
*
* @param name string to check
* @return true if name is a valid Name
*/
public static boolean isValidName(String name) {
final int length = name.length();
if (length == 0) {
return false;
}
char ch = name.charAt(0);
if (!isNameStart(ch)) {
return false;
}
for (int i = 1; i < length; ++i) {
ch = name.charAt(i);
if (!isName(ch)) {
return false;
}
}
return true;
}
/**
* Returns true if the specified character is a valid name start
* character as defined by production [5] in the XML 1.0
* specification.
*
* @param c The character to check.
*/
public static boolean isNameStart(int c) {
return c < 0x10000 && (CHARS[c] & MASK_NAME_START) != 0;
}
/**
* Returns true if the specified character is a valid name
* character as defined by production [4] in the XML 1.0
* specification.
*
* @param c The character to check.
*/
public static boolean isName(int c) {
return c < 0x10000 && (CHARS[c] & MASK_NAME) != 0;
}
}
Upvotes: 2
Reputation: 120486
The relevant production from the spec is http://www.w3.org/TR/xml/#NT-Name
Name ::== NameStartChar NameChar *
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
So a regex to match it is
"^[:A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02ff\\u0370-\\u037d"
+ "\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f\\u2c00-\\u2fef\\u3001-\\ud7ff"
+ "\\uf900-\\ufdcf\\ufdf0-\\ufffd\\x10000-\\xEFFFF]"
+ "[:A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6"
+ "\\u00F8-\\u02ff\\u0370-\\u037d\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f"
+ "\\u2c00-\\u2fef\\u3001-\\udfff\\uf900-\\ufdcf\\ufdf0-\\ufffd\\-\\.0-9"
+ "\\u00b7\\u0300-\\u036f\\u203f-\\u2040]*\\Z"
If you want to deal with namespaced names, you need to make sure that there is at most one colon, so
"^[A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02ff\\u0370-\\u037d"
+ "\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f\\u2c00-\\u2fef\\u3001-\\udfff"
+ "\\uf900-\\ufdcf\\ufdf0-\\ufffd]"
+ "[A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02ff\\u0370-\\u037d"
+ "\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f\\u2c00-\\u2fef\\u3001-\\udfff"
+ "\\uf900-\\ufdcf\\ufdf0-\\ufffd\\-\\.0-9\\u00b7\\u0300-\\u036f\\u203f-\\u2040]*"
+ "(?::[A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02ff\\u0370-\\u037d"
+ "\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f\\u2c00-\\u2fef\\u3001-\\udfff"
+ "\\uf900-\\ufdcf\\ufdf0-\\ufffd]"
+ "[A-Z_a-z\\u00C0\\u00D6\\u00D8-\\u00F6\\u00F8-\\u02ff\\u0370-\\u037d"
+ "\\u037f-\\u1fff\\u200c\\u200d\\u2070-\\u218f\\u2c00-\\u2fef\\u3001-\\udfff"
+ "\\uf900-\\ufdcf\\ufdf0-\\ufffd\\-\\.0-9\\u00b7\\u0300-\\u036f\\u203f-\\u2040]*)?\\Z"
(missed another 03gf; changed both to 036f)
Upvotes: 4
Reputation: 169
Using the org.apache.xerces utilities is a good way to go; however, if you need to stick to Java code that's part of the standard Java API then the following code will do it:
public void parse(String xml) throws Exception {
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(new DefaultHandler());
InputSource source = new InputSource(new ByteArrayInputStream(xml.getBytes()));
parser.parse(source);
}
Upvotes: 2
Reputation: 24299
If you are using Xerces XML parser, you can use the XMLChar (or XML11Char) class isValidName()
method, like this:
org.apache.xerces.util.XMLChar.isValidName(String name)
There is also sample code available here for isValidName
.
Upvotes: 14