tt0686
tt0686

Reputation: 1849

Regular expression Hostname

I am developing a http robot, and I developed this regular expression (((?:f|ht)tp(?:s)?\\://)?|www)([^/]+) to detect and extract the hostname from a link (href). Now I put here the results of the tests:

String -> http://www.meloteca.com/empresas-editoras.htm   
Returns   http://www.meloteca.com
String -> www.meloteca.com/empresas-editoras.htm    
Returns   www.meloteca.com
String -> /empresas-editoras.htm   
Returns   empresas-editoras.htm (without the slash)

In this case I was expecting that the regular expressions did not return any value? Why is this happening? The same thing if I try with the following String

String -> empresas-editoras.htm   
Returns   empresas-editoras.htm

The snippet of code :

Pattern padrao = Pattern.compile("(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)");
        Matcher mat = padrao.matcher("empresas-editoras.htm");
        if(mat.find())
            System.out.println("Host->"+mat.group());

Upvotes: 1

Views: 991

Answers (3)

benroth
benroth

Reputation: 2618

The alternative ((?:f|ht)tp(?:s)?\\://)? is optional, so it can be the empty string, and then ([^/]+) just will match any string not containing /.

Upvotes: 0

Petter
Petter

Reputation: 4165

If you remove one of the question marks, like this:

(((?:f|ht)tp(?:s)?\\://)|www)([^/]+)

it should work better.

Upvotes: 1

Wyzard
Wyzard

Reputation: 34581

It'd be better to use the URI class, and its methods like getHost() and getPath(), rather than a regular expression. The rules for constructing URIs are more complex than you probably realize, and your regex is likely to have lots of corner cases that won't be handled correctly.

Upvotes: 3

Related Questions