Reputation: 1849

Regular expression Hostname

I am developing a http robot, and I developed this regular expression (((?:f|ht)tp(?:s)?\\://)?|www)([^/]+) to detect and extract the hostname from a link (href). Now I put here the results of the tests:

String -> http://www.meloteca.com/empresas-editoras.htm   
Returns   http://www.meloteca.com

String -> www.meloteca.com/empresas-editoras.htm    
Returns   www.meloteca.com

String -> /empresas-editoras.htm   
Returns   empresas-editoras.htm (without the slash)

In this case I was expecting that the regular expressions did not return any value? Why is this happening? The same thing if I try with the following String

String -> empresas-editoras.htm   
Returns   empresas-editoras.htm

The snippet of code :

Pattern padrao = Pattern.compile("(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)");
        Matcher mat = padrao.matcher("empresas-editoras.htm");
        if(mat.find())
            System.out.println("Host->"+mat.group());

Upvotes: 1

Answers (3)

benroth

Reputation: 2618

The alternative ((?:f|ht)tp(?:s)?\\://)? is optional, so it can be the empty string, and then ([^/]+) just will match any string not containing /.

Upvotes: 0

Petter

Reputation: 4165

If you remove one of the question marks, like this:

(((?:f|ht)tp(?:s)?\\://)|www)([^/]+)

it should work better.

Upvotes: 1

Wyzard

Reputation: 34581

It'd be better to use the URI class, and its methods like getHost() and getPath(), rather than a regular expression. The rules for constructing URIs are more complex than you probably realize, and your regex is likely to have lots of corner cases that won't be handled correctly.

Upvotes: 3

Regular expression Hostname

Answers (3)

Related Questions