Reputation: 7643
I was reading this question about how to parse URLs out of web pages and had a question about the accepted answer which offered this solution:
((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)
The solution was offered by csmba and he credited it to regexlib.com. Whew. Credits done.
I think this is a fairly naive regular expression but it's a fine starting point for building something better. But, my question is this:
What is the point of {1}
? It means "exactly one of the previous grouping", right? Isn't that the default behavior of a grouping in a regular expression? Would the expression be changed in any way if the {1}
were removed?
If I saw this from a coworker I would point out his or her error but as I write this the response is rated at a 6 and the expression on regexlib.com is rated a 4 of 5. So maybe I'm missing something?
Upvotes: 4
Views: 953
Reputation: 58921
I don't think it has any purpose. But because RegEx is almost impossible to understand/decompose, people rarely point out errors. That is probably why no one else pointed it out.
Upvotes: 1
Reputation: 241770
@Jeff Atwood, your interpretation is a little off - the {1} means match exactly once, but has no effect on the "capturing" - the capturing occurs because of the parens - the braces only specify the number of times the pattern must match the source - once, as you say.
I agree with @Marius, even if his answer is a little terse and may come off as being flippant. Regular expressions are tough, if one's not used to using them, and the {1} in the question isn't quite error - in systems that support it, it does mean "exactly one match". In this sense, it doesn't really do anything.
Unfortunately, contrary to a now-deleted post, it doesn't keep the regexp from matching http://http://example.org
, since the \S+ at the end will match one or more non-whitespace characters, including the http://example.org
in http://http://example.org
(verified using Python 2.5, just in case my regexp reading was off). So, the regexp given isn't really the best. I'm not a URL expert, but probably something limiting the appearance of ":"s and "//"s after the first one would be necessary (but hardly sufficient) to ensure good URLs.
Upvotes: 1
Reputation: 63949
I don't think the {1} has any valid function in that regex.
(**mailto:|(news|(ht|f)tp(s?))://){1}**
You should read this as: "capture the stuff in the parens exactly one time". But we don't really care about capturing this for use later, eg $1 in the replacement. So it's pointless.
Upvotes: 2
Reputation: 7643
@Rob: I disagree. To enforce what you are asking for I think you would need to use negative-look-behind, which is possible but is certainly not related to use {1}. Neither version of the regexp address that particular issue.
To let the code speak:
tibook 0 /home/jj33/swap > cat text
Text this is http://example.com text this is
Text this is http://http://example.com text this is
tibook 0 /home/jj33/swap > cat p
#!/usr/bin/perl
my $re1 = '((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)';
my $re2 = '((mailto\:|(news|(ht|f)tp(s?))\://)\S+)';
while (<>) {
print "Evaluating: $_";
print "re1 saw \$1 = $1\n" if (/$re1/);
print "re2 saw \$1 = $1\n" if (/$re2/);
}
tibook 0 /home/jj33/swap > cat text | perl p
Evaluating: Text this is http://example.com text this is
re1 saw $1 = http://example.com
re2 saw $1 = http://example.com
Evaluating: Text this is http://http://example.com text this is
re1 saw $1 = http://http://example.com
re2 saw $1 = http://http://example.com
tibook 0 /home/jj33/swap >
So, if there is a difference between the two versions, it's doesn't seem to be the one you suggest.
Upvotes: 3