Rohit Chopra
Rohit Chopra

Reputation: 2811

PHP URL validation

I know there are an infinite number of threads asking this question, but I have not been able to find one that can help me with this.

I am basically trying to parse a list of around 10,000,000 URLs, make sure they are valid per the following criteria and then get the root domain URL. This list contains just about everything you can imagine, including stuff like (and the expected formatted url):

biy.ly/test [VALID] [return - bit.ly]
example.com/apples?test=1&id=4 [VALID] [return - example.com]
host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com]
101.121.44.xxx [**inVALID**] [return false]
localhost/noway [**inVALID**] [return false]
www.awesome.com [VALID] [return - awesome.com]
i am so awesome [**inVALID**] [return false]
http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com]
www1.151.com/searchresults [VALID] [return - 151.com]

Does any one have any suggestions for this?

Upvotes: 0

Views: 6721

Answers (4)

Swadesh
Swadesh

Reputation: 1

$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$w$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
  {
  $websiteErr = "Invalid URL";
  }ebsite))
  {
  $websiteErr = "Invalid URL";
  }

Upvotes: 0

dqhendricks
dqhendricks

Reputation: 19251

I would start with the default:

filter_var($inputUrl, FILTER_VALIDATE_URL);

Then add your special cases of things that are not acceptable for further validation. This should simplify a bit.

As for getting the host.

parse_url($inputUrl, PHP_URL_HOST);

Upvotes: 2

Tomalak
Tomalak

Reputation: 338108

^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)

Explanation

^                # start-of-line
(?:              # begin non-capturing group
  https?         #   "http" or "https"
  ://            #   "://"
)?               # end non-capturing group, make optional
(?:              # start non-capturing group
  [a-z0-9-]+\.   #   a name part (numbers, ASCII letters, dashes) & a dot
)*               # end non-capturing group, match as often as possible
(                # begin group 1 (this will be the domain name)
  (?:            #   start non-capturing group
    [a-z0-9-]+\. #     a name part, same as above
  )              #   end non-capturing group
  [a-z]+         #   the TLD
)                # end group 1 

http://rubular.com/r/g6s9bQpNnC

Upvotes: 15

JNF
JNF

Reputation: 3730

^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$

edit

In php that would be preg_match ( '^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$' , $myUrls , $matches)

What you need would be in $matches[1]

Upvotes: 0

Related Questions