Reputation:
I have the following regex:
http://([^:]*):?([0-9]*)(/.*)
When I match that against http://brandonhsiao.com/essays/showers.html
, the parentheses grab: http://brandonhsiao.com/essays
and /showers.html
. How can I get it to grab http://brandonhsiao.com
and /essays/showers.html
?
Upvotes: 2
Views: 146
Reputation: 6552
Put a question mark after the first * you have to make it non-greedy. Right now your code for matching the hostname is grabbing everything all the way up to the last /
.
http://([^:]*?):?([0-9]*)(/.*)
But that's not even what I would recommend. Try this instead:
(http://[^\s/]+)([^\s?#]*)
$1
should have http://brandonhsiao.com
and $2
should have /essays/showers.html
and any hash or query string is ignored.
Note that this is not designed to validate a URL, just to divide a URL up into the portion before the path, and the path itself. For example, it would happily accept invalid characters as part of the hostname. However, it does work fine for URLs with or without paths.
P.S. I don't know exactly what you are doing with this in Lisp, so I have taken the liberty of only testing it in other PCRE-compatible environments. Usually I test my answers in the exact context where they will be used.
$_ = "http://brandonhsiao.com/essays/showers.html";
m|(http://[^\s/]+)([^\s?#]*)|;
print "1 = '$1' and 2 = '$2'\n";
# [j@5 ~]$ perl test2.pl
# 1 = 'http://brandonhsiao.com' and 2 = '/essays/showers.html'
Upvotes: 3
Reputation: 718
http:\/\/([^:]*?)(\/.*)
The *?
is a non-greedy match to the first slash (the one just after .com)
See http://rubular.com/r/VmU2ghAX0k for match groups
Upvotes: 0
Reputation: 3077
http://([^/:]*):?([0-9]*)(/.*)
The first group is matching everything but :
and now I added /
, that's because the [^]
operator means match everything but what's inside the group, everything else is just the same.
Hope it helped!
Upvotes: 0