tsilb
tsilb

Reputation: 8037

How can I make this regex match correctly?

Given this regex:

^((https?|ftp):(\/{2}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}
|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1})

Reformatted for readability:

@"^((https?|ftp):(\/{2}))?" + // http://, https://, ftp:// - Protocol Optional
@"(" + // Begin URL payload format section
@"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)" + // IPv4 Address support
@")|("+ // Delimit supported payload types
@"((([a-zA-Z0-9]+)(\.)*?))(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum){1}" + // FQDNs
@")"; // End URL payload format section

How can I make it fail (i.e. not match) on this "fail" test case?

http://www.google

As I am specifying {1} on the TLD section, I would think it would fail without the extension. Am I wrong?

Edit: These are my PASS conditions:

These are my FAIL conditions:

Upvotes: 1

Views: 266

Answers (5)

Sedecimdies
Sedecimdies

Reputation: 152

Its all about definitions, a "valid url" should provide you with a IP address when you do a DNS Lookup. The IP should be connected to and when a request is send out, you get a reply in the form of a HTML information that you can use.

So what we are looking for is a "valid URL Format" and that is where the system.uri comes in very handy. BUT, if the URL is hidden in a large piece of tekst, you would first like to find something that validates as a valid URL-Format.

The thing that distinquishes a URL from any given readable tekst is the dot not followed by whitespace. "123.com" could validate as a real URL.

Using the regex

[a-z_\.\-0-9]+\.[a-z]+[^ ]*

to find any possible valid url in a text and then do a system.uri check to see if its a valid URL format and then do a lookup. Only when the lookup gives you a result then you know the URL is valid.

Upvotes: 0

Zano
Zano

Reputation: 2761

Sometimes, one catch-all reqex is not the best solution, however tempting. While debugging this regex is feasible (see Greg Hewgills answer), consider doing a couple of tests for different categories of problems, e.g. one test for numerical addresses and one test for named addresses.

Upvotes: 3

bobbymcr
bobbymcr

Reputation: 24167

I'll throw out an alternative suggestion. You may want to use a combination of the parsing of the built-in System.Uri class and a couple targeted regexes (or simple string checks when appropriate).

Example:

string uriString = "...";

Uri uri;
if (!Uri.TryCreate(uriString, UriKind.Absolute, out uri))
{
    // Uri is totally invalid!
}
else
{
    // validate the scheme
    if (!uri.Scheme.Equals("http", StringComparison.OrdinalIgnoreCase))
    {
        // not http!
    }

    // validate the authority ('www.blah.com:1234' portion)
    if (uri.Authority // ...)
    {
    }

    // ...
}

Upvotes: 4

Zano
Zano

Reputation: 2761

The "validate a url" problem has been solved* numerous times. I suggest you use the System.Uri class, it validates more cases than you can shake a stick at.

The code Uri uri = new Uri("http://whatever"); throws a UriFormatException if it fails validation. That is probably what you'd want.

*) Or kind of solved. It's actually pretty tricky to define what is a valid url.

Upvotes: 1

Greg Hewgill
Greg Hewgill

Reputation: 993163

You need to force your regex to match up until the end of the string. Add a $ at the very end of it. Otherwise, your regex is probably just matching http://, or something else shorter than your whole string.

Upvotes: 2

Related Questions