Reputation: 11107
[a-zA-Z_:]([a-zA-Z0-9_:.])*
Would this do?
Upvotes: 5
Views: 9946
Reputation: 647
Given the following basic criteria:
I use the following regex pattern for basic XML element (tag) name validation:
/^([_a-z][\w]?|[a-w_yz][\w]{2,}|[_a-z][a-l_n-z\d][\w]+|[_a-z][\w][a-k_m-z\d][\w]*)$/i
...which is pretty short compared with most single-string examples I've seen and it works quickly, and very well, within the confines of the outlined strictures.
Breakdown:
I posted this with the idea that it may help anyone who's looking for a slightly more straight-forward (albeit Anglo-centric) solution for parsing a simplified set of XML element (tag) names.
Upvotes: 2
Reputation: 94
^(:|[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD])(:|[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|-|\\.|[0-9]|\xB7|[\u0300-\u036F]|[\u203F-\u2040])*$
This would match correctly all but [#xFDF0-#xFFFD]|[#x10000-#xEFFFF] as it is not possible (as far as I know) to match ASCII characters outside 16bit in regex.
To correct xml names you can use this function:
private static function getValidXMLName($value){
$validStartNameChar =
'[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\x{2FF}]|[\x{370}-\x{37D}]|[\x{37F}-\x{1FFF}]|'.
'[\x{200C}-\x{200D}]|[\x{2070}-\x{218F}]|[\x{2C00}-\x{2FEF}]|[\x{3001}-\x{D7FF}]|[\x{F900}-\x{FDCF}]|[\x{FDF0}-\x{FFFD}]';
$validNameChar = $validStartNameChar . '|\-|\.|[0-9]|\xB7|[\x{300}-\x{36F}]|[\x{203F}-\x{2040}]';
$valueClean = preg_replace('/(?!'.$validNameChar.')./u','',$value);
$firstChar = mb_substr($valueClean, 0, 1);
if (!(strlen(preg_replace('/(?!'.$validStartNameChar.')./u', '', $firstChar)) > 0)) {
return '_' . $valueClean;
}
return $valueClean;
}
This will replace any incorrect characters with nothing and if the first character after this is not a valid first character will prepend an underscore
Its not maybe the prettiest or best way but for what I am using it for (building an XML log) it will be fine
Upvotes: 3
Reputation: 31
Background Information:
According to w3schools.com the rules for tag names in XML are
Possible Solution:
Let's do it in a couple of steps, using javascript. Please feel free to translate as necessary. Why one complex regex when you can break it down into more readable and maintainable code with multiple regex tests?
function isXMLTagName ( tag ) // returns true if meets cond. 1-5 above
{
var t = !/^[xX][mM][lL].*/.test(tag); // condition 3
t = t && /^[a-zA-Z_].*/.test(tag); // condition 2
t = t && /^[a-zA-Z0-9_\-\.]+$/.test(tag); // condition 4
return t;
}
I have this same problem in a project right now. Hope this works.
Upvotes: 3
Reputation: 19002
EDIT:
.NET also has the method XmlConvert.VerifyName(string).
From Wikipedia:
Unicode characters in the following code point ranges are valid in XML 1.0 documents:
Unicode characters in the following code point ranges are always valid in XML 1.1 documents:
The preceding code points are contained in the following code point ranges which are only valid in certain contexts in XML 1.1 documents:
Upvotes: 5
Reputation: 1074666
Do you mean XML element names? If so, no, that's too exclusive, there are lots of valid characters that that doesn't cover. More in the spec here and here:
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
[#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
[#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 |
[#x0300-#x036F] | [#x203F-#x2040]
Name ::= NameStartChar (NameChar)*
Upvotes: 14