Michael
Michael

Reputation: 42050

Illegal characters in URI

The java.net.URI ctor accepts most non-ASCII characters but does not accept ideographic space (0x3000). The ctor fails with java.net.URISyntaxException: Illegal character in path ...

So my questions are:

Upvotes: 0

Views: 6622

Answers (2)

Stephen C
Stephen C

Reputation: 718798

Please note the 1st example contains the ideographic space rather than a regular space.

It is the ideographic space that is the problem.

Here is the code that allows non-ASCII characters to be used:

        } else if ((c > 128)
                   && !Character.isSpaceChar(c)
                   && !Character.isISOControl(c)) {
            // Allow unescaped but visible non-US-ASCII chars
            return p + 1;
        }

As you can see, it disallows "funky" non-visible characters.

See also the URI class javadocs which specifies which characters are allowed (by the class!) in each component of a URI.

Why?

It is probably a safety measure.

What others are disallowed?

An character that is whitespace or a control character ... according to the respective Character predicate methods. (See the Character javadocs for a precise specification.)

You should also note that this is a deviation from the URI specification. The URI specification says that non-ASCII characters are only allowed if you:

  • convert the UCS character code to UTF-8, and
  • percent encode the UTF-8 bytes as required by the spec.

My understanding is that the URI.toASCIIString() method will take care of that if you have a "deviant" java.net.URI object.

Upvotes: 0

Ian Roberts
Ian Roberts

Reputation: 122364

The set of acceptable characters is spelled out in detail in the JavaDoc documentation for java.net.URI

Character categories

RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:

  • alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
  • digit The US-ASCII decimal digit characters, '0' through '9'
  • alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
  • punct The characters in the string ",;:$&+="
  • reserved All punct characters together with those in the string "?/[]@"
  • escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
  • other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)

The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.

In particular, "other" does not include space characters, which are defined (by Character.isSpaceChar) as those with Unicode general category types

  • SPACE_SEPARATOR
  • LINE_SEPARATOR
  • PARAGRAPH_SEPARATOR

and according to the page you've linked to in the question, the ideographic space character is indeed one of these types.

Upvotes: 5

Related Questions