Reputation: 2811
I know there are an infinite number of threads asking this question, but I have not been able to find one that can help me with this.
I am basically trying to parse a list of around 10,000,000 URLs, make sure they are valid per the following criteria and then get the root domain URL. This list contains just about everything you can imagine, including stuff like (and the expected formatted url):
biy.ly/test [VALID] [return - bit.ly]
example.com/apples?test=1&id=4 [VALID] [return - example.com]
host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com]
101.121.44.xxx [**inVALID**] [return false]
localhost/noway [**inVALID**] [return false]
www.awesome.com [VALID] [return - awesome.com]
i am so awesome [**inVALID**] [return false]
http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com]
www1.151.com/searchresults [VALID] [return - 151.com]
Does any one have any suggestions for this?
Upvotes: 0
Views: 6721
Reputation: 1
$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$w$website = test_input($_POST["website"]);
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
{
$websiteErr = "Invalid URL";
}ebsite))
{
$websiteErr = "Invalid URL";
}
Upvotes: 0
Reputation: 19251
I would start with the default:
filter_var($inputUrl, FILTER_VALIDATE_URL);
Then add your special cases of things that are not acceptable for further validation. This should simplify a bit.
As for getting the host.
parse_url($inputUrl, PHP_URL_HOST);
Upvotes: 2
Reputation: 338108
^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)
Explanation
^ # start-of-line
(?: # begin non-capturing group
https? # "http" or "https"
:// # "://"
)? # end non-capturing group, make optional
(?: # start non-capturing group
[a-z0-9-]+\. # a name part (numbers, ASCII letters, dashes) & a dot
)* # end non-capturing group, match as often as possible
( # begin group 1 (this will be the domain name)
(?: # start non-capturing group
[a-z0-9-]+\. # a name part, same as above
) # end non-capturing group
[a-z]+ # the TLD
) # end group 1
http://rubular.com/r/g6s9bQpNnC
Upvotes: 15
Reputation: 3730
^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$
edit
In php that would be preg_match ( '^(([a-zA-Z](\.[a-zA-Z])+)|([0-9]{1,3}(\.[0-9]{1,3}){3})/.*$' , $myUrls , $matches)
What you need would be in $matches[1]
Upvotes: 0