Reputation: 25924
I am trying to match parts of a URL. First I am trying to get a match only for something like this:
http://Stackoverflow.com/questions/blah/balh.blah
http://www.stackoverflow.com/questions/blah/balh.blah
stackoverflow.com/questions/blah/balh.blah
www.stackoverflow.com/
But I want to use other protocols such as https
and ftp
as well.
I myself wrote something like this which is not good at all :
((http:\/\/|https:\/\/|ftp:\/\/)*)((www.)*)([a-z]+).([a-z]{2,3})(\/)*
There are lots of problems with this regex, and I need to figure out how to get it fixed.
First, How can I specify for example that only http
or https
are valid and not htttp
or hazzzzt
or etc?
To be more precise:
What is clear now is that (http)
is not treated like a word, it is just a class set of characters, so any word that has only one of those letters gets a match.
I have read about \b
that works as a word boundary, but it seems \bhttp\b
doesn't actually mean treat http
as a single word rather than a set of characters!
And for the www
part, matches wwww
and ww
or any other number of w
s!
I am always getting a match no matter what I input!
I use http://regex101.com/ to test the regex.
Upvotes: 1
Views: 202
Reputation: 41838
Hossein, there are several points and questions in your question.
A. How to include or exclude some specific patterns in regex?
There are many techniques. For simple patterns, you specify what you want, or you specify what you don't want, either with negative character classes or negative lookaround. For more intricate patterns, a great place to start is Match (or replace) a pattern except in situations s1, s2, s3 etc
B. How can a specific word be included or excluded?
In general, to make sure a specific word belongs or doesn't belong to a string, if you don't know its placement, you do a lookahead (or negative lookahead) at the beginning of the string:
^(?=.*?MyWord) # makes sure the word is there
or
^(?!.*?MyWord) # makes sure the word is not there
C. What is clear now, is that (http) is not treated like a word, it is just a class set of characters, so any word that has only one of those letters gets a match
That is not correct. (http)
will only match http
. It will not match ptth
, for instance. Perhaps you are thinking of [http]
, which would be a character class allowing characters h, t and p to be matched once (and inefficient since [pth]
would do)
D. How to Match the Parts of a URL
There are many solutions to this, but for today I'd suggest not reinventing the wheel. May I suggest the regex in the RegexBuddy library for this purpose? It is
(?i)\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
Here follows a token-by-token explanation (I added the case-insensitive (?i)
modifier at the beginning.)
\b
((?#protocol)https?|ftp)
(?#protocol)https?
(?#protocol)
http
s?
?
ftp
ftp
://
((?#domain)[-A-Z0-9.]+)
(?#domain)
[-A-Z0-9.]+
+
-
A-Z
0-9
.
((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?
?
(?#file)
/
[-A-Z0-9+&@#/%=~_|!:,.;]*
*
-
A-Z
0-9
+&@#/%=~_|!:,.;
((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
?
(?#parameters)
\?
[A-Z0-9+&@#/%=~_|!:,.;]*
*
A-Z
0-9
+&@#/%=~_|!:,.;
Upvotes: 2
Reputation: 494
Dont think you need the outer parenthesis e.g below is to match http:// or www. (Make sure you escape the period)
(http:\/\/|www\.)
Also if you are using preg_match there are slight difference to apache .htaccess for jnstance you use a character to indicate start and end of pattern like a #
$regEx = '#(http:\/\/|www\.)#';
Upvotes: 1
Reputation: 1099
Maybe you can use the PHP filter function?
if (filter_var($url, FILTER_VALIDATE_URL) !== false)
FILTER_VALIDATE_URL validates URLs according to RFC 2396.
http://www.php.net/manual/de/filter.filters.validate.php
Upvotes: 0