Reputation: 1079
I am new to regex. I am studying it in regularexperssion.com. The question is that I need to know what is the use of a colon (:) in regular expressions.
For example:
$pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
which matches:
$url1 = "http://www.somewebsite.com";
$url2 = "https://www.somewebsite.com";
$url3 = "https://somewebsite.com";
$url4 = "www.somewebsite.com";
$url5 = "somewebsite.com";
Yeah, any help would be greatly appreciated.
Upvotes: 57
Views: 119872
Reputation: 191779
I've decided to go you one better and explain the entire regex:
^ # anchor to start of line
( # start grouping
( # start grouping
[\w]+ # at least one of 0-9a-zA-Z_
: # a literal colon
) # end grouping
? # this grouping is optional
\/\/ # two literal slashes
) # end capture
? # this grouping is optional
(
(
[\d\w] # exactly one of 0-9a-zA-Z_
# having \d is redundant
| # alternation
% # literal % sign
[a-fA-f\d]{2,2} # exactly 2 hexadecimal digits
# should probably be A-F
# using {2} would have sufficed
)+ # at least one of these groups
( # start grouping
: # literal colon
(
[\d\w]
|
%
[a-fA-f\d]{2,2}
)+
)? # Same grouping, but it is optional
# and there can be only one
@ # literal @ sign
)? # this group is optional
(
[\d\w] # same as [\w], explained above
[-\d\w]{0,253} # includes a dash (-) as a valid character
# between 0 and 253 of these characters
[\d\w] # end with \w. They want at most 255
# total and - cannot be at the start
# or end
\. # literal period
)+ # at least one of these groups
[\w]{2,4} # two to four \w characters
(
: # literal colon
[\d]+ # at least one digit
)?
(
\/ # literal slash
(
[-+_~.\d\w] # one of these characters
| # *or*
% # % with two hex digit combo
[a-fA-f\d]{2,2}
)* # zero or more of these groups
)* # zero or more of these groups
(
\? # literal question mark
(
&? # literal & or & (semicolon optional)
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)
=? # optional literal =
)* # zero or more of this group
)? # this group is optional
(
# # literal #
(
[-+_~.\d\w]
|
%
[a-fA-f\d]{2,2}
)*
)?
$ # anchor to end of line
It's important to understand what the metacharacters/sequences are. Some sequences are not meta when used in certain contexts (especially a character class). I've cataloged them for you:
^
-- zero width start of line()
-- grouping/capture?
-- zero or one of the preceding sequence+
-- one or more of the preceding sequence*
-- zero or more of the preceding sequence[]
-- character class\w
-- alphanumeric characters and _
. Opposite of \W
|
-- alternation{}
-- length assertion$
-- zero width end of lineThis excludes :
, @
, and %
from having any special/meta meaning in the raw context.
]
ends the character class. -
creates a range of characters unless it is at the start or the end of the character class or escaped with a backslash.
A (?
combination starts a grouping assertion. For example, (?:
means group but do not capture. This means that in the regex /(?:a)/
, it will match the string "a"
, but a
is not captured for use in replacement or match groups as it would be from /(a)/
.
?
can also be used for lookahead/lookbehind assertions with ?=
, ?!
, ?<=
, ?<!
. (?
followed by any sequence except what I mentioned in this section is just a literal ?
.
Upvotes: 49
Reputation: 64603
Colon :
is simply colon. It means nothing, except special cases like, for example, clustering without capturing (also known as a non-capturing group):
(?:pattern)
Also it can be used in character classes, for example:
[[:upper:]]
However, in your case colon is just a colon.
Special characters used in your regex:
In character class [-+_~.\d\w]
:
-
means -
+
means +
_
means _
~
means ~
.
means .
\d
means any digit\w
means any word characterThese symbols have this meaning because they are used in a symbol class []
.
Without symbol class +
and .
have special meaning.
Other elements:
=?
means =
that can occur 0 or 1 times; in other words =
that can occur or not, optional =
.Upvotes: 86
Reputation: 280
A colon has no special meaning in Regular Expressions, it just matches a literal colon.
[\w]+:
This just means any word character 1 or more times followed by a literal colon
The brackets are actually not needed here. Square brackets are used to define a group of characters to match. So
[abcd]
means a single character of a, b, c, d
Upvotes: 1
Reputation: 68810
There is no special use for colon :
in your case :
(([\w]+:)?\/\/)?
will match http://
, https://
, ftp://
...
You can find one special use for colon : every capturing group starting by (?:
won't appear in the results.
Example, with "foobarbaz" in input :
/foo((bar)(baz))/
=> { [1] => 'barbaz', [2] => 'bar', [3] => 'baz' }
/foo(?:(bar)(baz))/
=> { [1] => 'bar', [2] => 'baz' }
Upvotes: 7