URL is considered invalid by Apache UrlValidator

Question

The issue is our front end thinks this url valid while the backend thinks it's not. The URL is http://pyxlmedia.com/pc/talk=now&o=http://mobile.tmall.com/mobile .

You can see that after the word pc it's a '/' instead of '?'.

If I change '/' to '?' then both passes, i.e.,

http://pyxlmedia.com/pc?talk=now&o=http://mobile.tmall.com/mobile is considered valid by both org.apache.commons.validator.routines.UrlValidator (1.5.1) and the site http://formvalidation.io/validators/uri/ .

The test code is

@Test
public void test() {
    UrlValidator urlValidator = new UrlValidator(new String[] {"http", "https"});
    assertTrue(urlValidator.isValid("http://pyxlmedia.com/pc/talk=now&o=http://mobile.tmall.com/mobile"));
}

First I want to know is which one is wrong, front end or backend? Then how to make their behaviors consistent?

John Bollinger · Accepted Answer

I went back and forth several times as I analyzed this, but I have satisfied myself that your front-end is technically correct to accept the URL. Nevertheless, the tricksome URL may not mean what you think it means, so your back-end may be doing you a favor by flagging it.

The relevant standard here is provided by RFC 3986. (Slight modifications to the syntax are specified by RFC 7230 for the "http" URI scheme, but these do not change the analysis of the given URL.) According to the general URI syntax, the input URL breaks into components like this:

scheme: http

(delimiter) ://

authority: pyxlmedia.com

path: /pc/talk=now&o=http://mobile.tmall.com/mobile

Note in particular that the URL contains no query component, unlike the variation you presented that both validators accept.

The path component contains five segments, and your back-end validator is presumably tripping over one of these unusual characteristics of that component:

one segment is empty
the second segment contains unescaped characters '=' and '&', which the URI syntax classifies as "sub-delims"
the second segment contains an unescaped ':' character, which the URI syntax classifies as a "gen-delim"

However, analysis of the syntax for the path component (section 3.3 of RFC 3986) shows that segments other than the first in an absolute path are permitted to be empty, and that the ':' character and all the sub-delims are allowed to appear unescaped in path segments. (And RFC 7230 allows the first segment of an absolute path to be empty, too.)

From the "I don't think it means what you think it means" department, however, I want to emphasize that the path breaks up into these segments:

pc

talk=now&o=http:

(empty)

mobile.tmall.com

mobile

Note in particular how the apparent URL within the path splits across four path segments.

As for how to make the behavior consistent, it depends on which behavior you actually want.

Apache UrlValidator does not have many configuration options, but one that it does have is ALLOW_2_SLASHES, which allows doubled slashes in the path component of URLs. I am uncertain whether turning that option on will be sufficient to make it accept the given URL, but leaving it disabled surely contributes to rejecting the URL. If that is not sufficient and you want to accept the URL, then it looks like you'll need to choose or write a different validator.

For its part, the validator at http://formvalidation.io/validators/uri/ appears to have an equally small, but different set of configuration options, and I don't see one among them that I would expect to modulate its evaluation of the URL in question. If you want to reject the troublesome URL at the front end, therefore, then you'll need to find or write a different validator.

URL is considered invalid by Apache UrlValidator

Answers (2)

Related Questions