F.P
F.P

Reputation: 17831

PHP regex for valid XML tag name

What is a good general regex (in PHP terms) to determine if a string is a valid XML tag name?

I startet using /[^>]+/i but that also matches something like 4 \<< which obviously isn't a valid tag name.

So I tried combining all valid characters like /[a-z][a-z0-9_-]*/i which also isn't quite right, as XML allows virtually any character in tag names also of foreign languages.

I'm stuck on that now - should I just check if there are whitespace characters? Or is there more to it?

Upvotes: 4

Views: 1616

Answers (3)

hoppa
hoppa

Reputation: 3041

From the same specification but then a bit more clear:

"Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references."

As far as I can interpret that, almost everything goes. As Gordon states below, using a parser which knows the rules is best!

Upvotes: 1

Gordon
Gordon

Reputation: 317029

why dont you just use an XML parser/generator which already knows the rules?

function isValidXmlElementName($elementName)
{
    try {
        new DOMElement($elementName);
    } catch (DOMException $e) {
        return false;
    }
    return true;
}

var_dump(isValidXmlElementName(' ')); // false 
var_dump(isValidXmlElementName('1')); // false
var_dump(isValidXmlElementName('-')); // false
var_dump(isValidXmlElementName('a')); // true

Upvotes: 10

Mark Byers
Mark Byers

Reputation: 838376

From the XML specification:

[4]     NameStartChar      ::=      ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]    NameChar       ::=      NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]     Name       ::=      NameStartChar (NameChar)*

Upvotes: 4

Related Questions