simpatico
simpatico

Reputation: 11107

What would be a regex for valid xml names?

[a-zA-Z_:]([a-zA-Z0-9_:.])*

Would this do?

Upvotes: 5

Views: 9946

Answers (6)

NetXpert
NetXpert

Reputation: 647

Given the following basic criteria:

  • permitted characters are the standard 26 Latin letters, 10 Arabic numerals, and the underscore,
  • the leading character can be only a valid letter or an underscore,
  • the name cannot start with "xml" in any case variation

I use the following regex pattern for basic XML element (tag) name validation:

/^([_a-z][\w]?|[a-w_yz][\w]{2,}|[_a-z][a-l_n-z\d][\w]+|[_a-z][\w][a-k_m-z\d][\w]*)$/i

...which is pretty short compared with most single-string examples I've seen and it works quickly, and very well, within the confines of the outlined strictures.

Breakdown:

  • the first block validates any string of 1 or 2 characters in length.
  • the second block validates any 3+ character string that doesn't start with an "x" (or "X").
  • the third block validates any 3+ character string that doesn't have an "m" (or "M") in the 2nd position.
  • the fourth block validates any 3+ character string that doesn't have an "l" (or "L") in the 3rd position.
  • /i sets the Case-Insensitive flag, to significantly reduce the number of character literals needed within the blocks.

I posted this with the idea that it may help anyone who's looking for a slightly more straight-forward (albeit Anglo-centric) solution for parsing a simplified set of XML element (tag) names.

Upvotes: 2

Joseph Bailey
Joseph Bailey

Reputation: 94

^(:|[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD])(:|[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\u02FF]|[\u0370-\u037D]|[\u037F-\u1FFF]|[\u200C-\u200D]|[\u2070-\u218F]|[\u2C00-\u2FEF]|[\u3001-\uD7FF]|[\uF900-\uFDCF]|[\uFDF0-\uFFFD]|-|\\.|[0-9]|\xB7|[\u0300-\u036F]|[\u203F-\u2040])*$

This would match correctly all but [#xFDF0-#xFFFD]|[#x10000-#xEFFFF] as it is not possible (as far as I know) to match ASCII characters outside 16bit in regex.

To correct xml names you can use this function:

private static function getValidXMLName($value){
    $validStartNameChar =
        '[A-Z]|_|[a-z]|[\xC0-\xD6]|[\xD8-\xF6]|[\xF8-\x{2FF}]|[\x{370}-\x{37D}]|[\x{37F}-\x{1FFF}]|'.
        '[\x{200C}-\x{200D}]|[\x{2070}-\x{218F}]|[\x{2C00}-\x{2FEF}]|[\x{3001}-\x{D7FF}]|[\x{F900}-\x{FDCF}]|[\x{FDF0}-\x{FFFD}]';
    $validNameChar = $validStartNameChar . '|\-|\.|[0-9]|\xB7|[\x{300}-\x{36F}]|[\x{203F}-\x{2040}]';
    
    $valueClean = preg_replace('/(?!'.$validNameChar.')./u','',$value);
    $firstChar = mb_substr($valueClean, 0, 1);
    if (!(strlen(preg_replace('/(?!'.$validStartNameChar.')./u', '', $firstChar)) > 0)) {
        return '_' . $valueClean;
    }
    
    return $valueClean;
}

This will replace any incorrect characters with nothing and if the first character after this is not a valid first character will prepend an underscore

Its not maybe the prettiest or best way but for what I am using it for (building an XML log) it will be fine

Upvotes: 3

zavr
zavr

Reputation: 2129

for Node 10 and newest Chrome

/[\p{L}_][\p{L}.\d_-]/u

Upvotes: 0

Nathan P. Cole
Nathan P. Cole

Reputation: 31

Background Information:

According to w3schools.com the rules for tag names in XML are

  1. Element names are case-sensitive
  2. Element names must start with a letter or underscore
  3. Element names cannot start with the letters xml (or XML, or Xml, etc)
  4. Element names can contain letters, digits, hyphens, underscores, and periods
  5. Element names cannot contain spaces

Possible Solution:

Let's do it in a couple of steps, using javascript. Please feel free to translate as necessary. Why one complex regex when you can break it down into more readable and maintainable code with multiple regex tests?

function isXMLTagName ( tag ) // returns true if meets cond. 1-5 above
{
    var t = !/^[xX][mM][lL].*/.test(tag); // condition 3 
    t = t && /^[a-zA-Z_].*/.test(tag);  // condition 2
    t = t && /^[a-zA-Z0-9_\-\.]+$/.test(tag); // condition 4
    return t; 
}

I have this same problem in a project right now. Hope this works.

Upvotes: 3

JohnB
JohnB

Reputation: 19002

EDIT:

.NET also has the method XmlConvert.VerifyName(string).

From Wikipedia:

Unicode characters in the following code point ranges are valid in XML 1.0 documents:

  • U+0009
  • U+000A
  • U+000D
  • U+0020–U+D7FF
  • U+E000–U+FFFD
  • U+10000–U+10FFFF

Unicode characters in the following code point ranges are always valid in XML 1.1 documents:

  • U+0001–U+0008
  • U+000B–U+000C
  • U+000E–U+001F
  • U+007F–U+0084
  • U+0086–U+009F

The preceding code points are contained in the following code point ranges which are only valid in certain contexts in XML 1.1 documents:

  • U+0001–U+D7FF
  • U+E000–U+FFFD
  • U+10000–U+10FFFF

Upvotes: 5

T.J. Crowder
T.J. Crowder

Reputation: 1074666

Do you mean XML element names? If so, no, that's too exclusive, there are lots of valid characters that that doesn't cover. More in the spec here and here:

NameStartChar    ::=    ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
                        [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |
                        [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
                        [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
                        [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] 

NameChar         ::=    NameStartChar | "-" | "." | [0-9] | #xB7 |
                        [#x0300-#x036F] | [#x203F-#x2040] 

Name             ::=    NameStartChar (NameChar)* 

Upvotes: 14

Related Questions