Reputation: 986
I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.
I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986) and different other URIs:
/g, //g, ///g, ////g, h//g, g////h, h///g:f
In this RFC, section 5.4.1 (link above) there is only example of //g:
"//g" = "http://g"
What about all other cases? As far as I understood from rfc 3986, section 3.3, multiple slashes are allowed. So, is following resolution correct?
"///g" = "http://a/b/c///g"
Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?
Update #1: Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577
What's going on here?
Upvotes: 0
Views: 558
Reputation: 385789
I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):
$ perl -MURI -e'
for my $rel (qw( /g //g ///g ////g h//g g////h h///g:f )) {
my $uri = URI->new($rel)->abs("http://a/b/c/d;p?q");
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
"http://a/b/c/d;p?q", $rel, $uri, $uri->host, $uri->path;
}
for my $base (qw( http://host/a/b/c/d http://host/a/b/c//d )) {
my $uri = URI->new("../../e")->abs($base);
printf "%-20s + %-7s = %-20s host: %-4s path: %s\n",
$base, "../../e", $uri, $uri->host, $uri->path;
}
'
http://a/b/c/d;p?q + /g = http://a/g host: a path: /g
http://a/b/c/d;p?q + //g = http://g host: g path:
http://a/b/c/d;p?q + ///g = http:///g host: path: /g
http://a/b/c/d;p?q + ////g = http:////g host: path: //g
http://a/b/c/d;p?q + h//g = http://a/b/c/h//g host: a path: /b/c/h//g
http://a/b/c/d;p?q + g////h = http://a/b/c/g////h host: a path: /b/c/g////h
http://a/b/c/d;p?q + h///g:f = http://a/b/c/h///g:f host: a path: /b/c/h///g:f
http://host/a/b/c/d + ../../e = http://host/a/e host: host path: /a/e
http://host/a/b/c//d + ../../e = http://host/a/b/e host: host path: /a/b/e
Next, we'll look at the syntax of relative URIs, since that's what your question circles around.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
segment = *pchar ; 0 or more <pchar>
segment-nz = 1*pchar ; 1 or more <pchar> nz = non-zero
The key things from these rules for answering your question:
path-absolute
) can't start with //
. The first segment, if provided, must be non-zero in length. If the relative URI starts with //
, what follows must be an authority
.//
can otherwise occur in a path because segments can have zero-length.Now, let's look at each of the resolutions you provided in turn.
/g
is an absolute path path-absolute
, and thus a valid relative URI (relative-ref
), and thus a valid URI (URI-reference
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: undef
Base.path: "/b/c/d;p" R.path: "/g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get:
T.path: "/g" ; remove_dot_segments(R.path)
T.query: undef ; R.query
T.authority: "a" ; Base.authority
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get:
http://a/g
//g
is different. //g
isn't an absolute path (path_absolute
) because an absolute path can't start with an empty segment ("/" [ segment-nz *( "/" segment ) ]
).
Instead, it's follows the following pattern:
"//" authority path-abempty
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: "g"
Base.path: "/b/c/d;p" R.path: ""
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "g" ; R.authority
T.path: "" ; remove_dot_segments(R.path)
T.query: "" ; R.query
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://g
Note: This contacts server g
!
///g
is similar to //g
, except the authority is blank! This is surprisingly valid.
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: ""
Base.path: "/b/c/d;p" R.path: "/g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.authority: "" ; R.authority
T.path: "/g" ; remove_dot_segments(R.path)
T.query: undef ; R.query
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http:///g
Note: While valid, this URI is useless because the server name (T.authority
) is blank!
////g
is the same as ///g
except the R.path
is //g
, so we get
http:////g
Note: While valid, this URI is useless because the server name (T.authority
) is blank!
The final three (h//g
, g////h
, h///g:f
) are all relative paths (path-noscheme
).
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Base.scheme: "http" R.scheme: undef
Base.authority: "a" R.authority: undef
Base.path: "/b/c/d;p" R.path: "h//g"
Base.query: "q" R.query: undef
Base.fragment: undef R.fragment: undef
Following the algorithm in §5.2.2, we get the following:
T.path: "/b/c/h//g" ; remove_dot_segments(merge(Base.path, R.path))
T.query: undef ; R.query
T.authority: "a" ; Base.authority
T.scheme: "http" ; Base.scheme
T.fragment: undef ; R.fragment
Following the algorithm in §5.3, we get the following:
http://a/b/c/h//g # For h//g
http://a/b/c/g////h # For g////h
http://a/b/c/h///g:f # For h///g:f
I don't think the examples are suitable for answering what I think you really want to know, though.
Take a look at the following two URIs. They aren't equivalent.
http://host/a/b/c/d # Path has 4 segments: "a", "b", "c", "d"
and
http://host/a/b/c//d # Path has 5 segments: "a", "b", "c", "", "d"
Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for ../../e
, you'd get
http://host/a/b/c/d + ../../e = http://host/a/e
and
http://host/a/b/c//d + ../../e = http://host/a/b/e
Upvotes: 5
Reputation: 132802
I was curious what Mojo::URL would do so I checked. There's a big caveat because it doesn't claim to be strictly compliant:
Mojo::URL implements a subset of RFC 3986, RFC 3987 and the URL Living Standard for Uniform Resource Locators with support for IDNA and IRIs.
Here's the program.
my @urls = qw(/g //g ///g ////g h//g g////h h///g:f
https:///stackoverflow.com////////a/////10161264/////6618577
);
my @parts = qw(scheme host port path query);
my $template = join "\n", map { "$_: %s" } @parts;
my $base_url = Mojo::URL->new( 'http://a/b/c/d;p?q' );
foreach my $u ( @urls ) {
my $url = Mojo::URL->new( $u )->base( $base_url )->to_abs;
no warnings qw(uninitialized);
say '-' x 40;
printf "%s\n$template", $u, map { $url->$_() } @parts
}
Here's the output:
----------------------------------------
/g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
//g
scheme: http
host: g
port:
path:
query: ----------------------------------------
///g
scheme: http
host: a
port:
path: /g
query: ----------------------------------------
////g
scheme: http
host: a
port:
path: //g
query: ----------------------------------------
h//g
scheme: http
host: a
port:
path: /b/c/h/g
query: ----------------------------------------
g////h
scheme: http
host: a
port:
path: /b/c/g/h
query: ----------------------------------------
h///g:f
scheme: http
host: a
port:
path: /b/c/h/g:f
query: ----------------------------------------
https:///stackoverflow.com////////a/////10161264/////6618577
scheme: https
host:
port:
path: /stackoverflow.com////////a/////10161264/////6618577
query:
Upvotes: 1