Reputation: 1849
I am developing a http robot, and I developed this regular expression
(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)
to detect and extract the hostname from a link (href).
Now I put here the results of the tests:
String -> http://www.meloteca.com/empresas-editoras.htm
Returns http://www.meloteca.com
String -> www.meloteca.com/empresas-editoras.htm
Returns www.meloteca.com
String -> /empresas-editoras.htm
Returns empresas-editoras.htm (without the slash)
In this case I was expecting that the regular expressions did not return any value? Why is this happening? The same thing if I try with the following String
String -> empresas-editoras.htm
Returns empresas-editoras.htm
The snippet of code :
Pattern padrao = Pattern.compile("(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)");
Matcher mat = padrao.matcher("empresas-editoras.htm");
if(mat.find())
System.out.println("Host->"+mat.group());
Upvotes: 1
Views: 991
Reputation: 2618
The alternative ((?:f|ht)tp(?:s)?\\://)?
is optional, so it can be the empty string, and then ([^/]+)
just will match any string not containing /
.
Upvotes: 0
Reputation: 4165
If you remove one of the question marks, like this:
(((?:f|ht)tp(?:s)?\\://)|www)([^/]+)
it should work better.
Upvotes: 1
Reputation: 34581
It'd be better to use the URI class, and its methods like getHost()
and getPath()
, rather than a regular expression. The rules for constructing URIs are more complex than you probably realize, and your regex is likely to have lots of corner cases that won't be handled correctly.
Upvotes: 3