Ewout
Ewout

Reputation: 2478

XSD regex not matching

I'm creating an XML file based on someone else's XSD specification, but I just can't figure out why it doesn't validate.

Here's the rule:

<xs:simpleType name="NonEmptyStringType">
    <xs:restriction base="xs:string">
        <xs:minLength value="1" />
        <xs:pattern value="[^\t\n\r]*[^\s][^\t\n\r]*" />
    </xs:restriction>
</xs:simpleType>

in which I read the pattern as follows:

and following example of one of the many mismatching xml:

        <Zipcode>3506 RT</Zipcode>

It's not matching 3506 RT (or 3506RT for that matter, and many other things I would expect to match) according to xmllint, with the following error:

element Zipcode: Schemas validity error : Element '{http://www.reeleezee.nl/taxonomy/1.23}Zipcode': [facet 'pattern'] The value '3506 RT' is not accepted by the pattern '[^\t\n\r]*[^\s][^\t\n\r]*'.

Any hints on what I'm not interpreting right? (I don't understand the strictness of their NonEmptyStringType btw, I would just use .+)


As requested, here's the zipcode declaration:

<xs:element name="Zipcode" minOccurs="0" nillable="true" rse:CanIgnore="true">
    <xs:annotation>
        <xs:documentation>Postcode</xs:documentation>
    </xs:annotation>
    <xs:simpleType>
        <xs:restriction base="NonEmptyStringType">
            <xs:maxLength value="10" />
        </xs:restriction>
    </xs:simpleType>
</xs:element>

as you can see, this links back to the pattern in NonEmptyStringType (first rule posted above)

Upvotes: 0

Views: 2479

Answers (3)

13ren
13ren

Reputation: 12187

This regex looks fine to me. I think it's a bug in your validation tool... they are often buggy in edge-cases.

OK, just checked: xerces accepts it; xmllint fails (I see you were using xmllint). I've found several times in the past that xerces is correct, and xmllint has problems in unusual cases. And this regex is unusual. (I have to say, I actually love xmllint, it's really fast, but the xsd spec is huge, complex and confusing, and the xmllint folks haven't nailed all the edge cases yet).

The two online validators I tried also accept it: http://www.utilities-online.info/xsdvalidation and http://www.freeformatter.com/xml-validator-xsd.html

BTW: for xerces, I downloaded their java version, and found their class jaxp.SourceValidator the best tool for validating. But I believe it's the same code already in java.


EDIT I did some more tests in xerces, to ensure that the regex can fail (i.e. it is active). It fails if there is a \n anywhere. (same for \t, though I didn't test \r).

Checking the spec, \s is defined as [#x20\t\n\r] (in this table). That makes it clear that the regex is saying you can't have \t, \n or \r anywhere. But you can have as many literal space characters (#x20) as you like, provided they aren't all space characters (i.e. there is at least one non-space char, to match that [^\s] - btw could notate that as \S). Xerces confirms this: all spaces gives an error.

Maybe they want to allow space literals (both padding and interspersing), provided there is some value in there (i.e. not all spaces).

Upvotes: 3

Rookie Programmer Aravind
Rookie Programmer Aravind

Reputation: 12154

[^\s] match anything that is not a space

but your input string 3506 RT has space!

I think that is why it is failing :) because [^\t\n\r] passed 3506 after which you don't expect a space character [^\s] but it appears ! And [^\t\n\r] also passes because the next set of chars is RT

So what you should have declared is:

<xs:pattern value="[^\t\n\r\s]*[\s][^\t\n\r\s]*" />

Now this will allow

  1. Anything that is NOT \t, \n, \r and \s to be more strict about pattern you would like to add + which allows string only if it has atleast one non-whitespace character in the beginning.
  2. A space character: we can have it as optional by declaring like this [\s]? .. where ? allows once or not at all. so space character can't repeat.
  3. again Anything that is NOT \t, \n, \r and \s.

<xs:pattern value="[^\t\n\r\s]+[\s]?[^\t\n\r\s]*" />

Actually it can be made more strict by validating number and alpha characters rather than having [^\t\n\r\s] declaration..

Hope it helps! And let me know if any question troubling you.

Upvotes: 1

Kyle Maxwell
Kyle Maxwell

Reputation: 627

I don't believe \r is a space, it's a carriage return (similar to \n newline). You might want to replace that with \s or just the actual literal " ".

Upvotes: 0

Related Questions