Reputation: 2369
I'm trying to find a regex that matches URLs with three or more repeating segments (and may include any number of directories) such as:
s1 = 'http://www.foo.com/bar/bar/bar/'
s2 = 'http://www.foo.com/baz/biz/baz/biz/baz/biz/etc'
s3 = '/foo/bar/foo/bar/foo/bar/'
and not match URLs like:
s4 = '/foo/bar/foo/bar/foo/barbaz'
First I tried:
re1 = /((.+\/)+)\1\1/
which works:
re1 === s1 #=> true
re1 === s2 #=> true
but as the number of segments grows, the regex match takes exponentially longer:
require 'benchmark'
Benchmark.bm do |b|
(10..15).each do |num|
str = '/foo/bar' * num
puts str
b.report("#{num} repeats:") { /((.+\/)+)\1\1/ === str }
end
end
user system total real
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
10 repeats: 0.060000 0.000000 0.060000 ( 0.054839)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
11 repeats: 0.210000 0.000000 0.210000 ( 0.213492)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
12 repeats: 0.870000 0.000000 0.870000 ( 0.871879)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
13 repeats: 3.370000 0.010000 3.380000 ( 3.399224)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
14 repeats: 13.580000 0.110000 13.690000 ( 13.790675)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
15 repeats: 54.090000 0.210000 54.300000 ( 54.562672)
Then, I tried a regex similar to one given here:
re2 = /(\/.+)(?=.*\1)\1\1/
which doesn't have performance issues, and matches the strings I'd like to to match:
re2 === s3 #=> true
but also matches strings I don't want it to match, such as:
re2 === s4 #=> true, but should be false
I'm close with the second regex. What am I missing?
Upvotes: 3
Views: 171
Reputation: 110675
Definitions
Suppose:
str = 'http://www.example.com/dog/baz/biz/baz/biz/baz/biz/cat/'
We can define '/dog'
, '/baz'
, '/biz'
and so on as segments. A group is comprised of one or more contiguous segments, such as '/dog'
, '/baz'
, '/dog/baz'
, '/baz'
, '/baz/biz'
, 'biz/baz'
, '/baz/biz/baz'
and so on.
The Problem
It is my understanding that the problem is to determine if a given string contains three (or more) contiguous and equal groups followed by a forward slash. s2
meets this test with the following substring:
'/baz/biz/baz/biz/baz/biz/'
Algorithm
I do not believe a single regular expression can be fashioned to make this determination, but we can write a regular expression to determine if at least three (or some arbitrary number) of contiguous, equal groups are present, given the number of segments per group. Suppose this is done by a method named contiguous_fixed_group_size?
, which, is called as follows:
contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
and returns true
or false
. To ensure that the string has at least 3 contiguous, equal groups (for a given value of segments_per_group
), we call this method with nbr_groups = 3
. I think it best to briefly postpone the construction of this method; for the present just assume it is available to us.
The approach I've taken is to call this method with different values of segments_per_group
and determine if the method returns true
for at least one of those values.
Main method
The first step is to determine the number of segments in the string (where str
holds the string given above):
r = /(?<!\/)\/(?!\/)/
nbr_segments = str.scan(r).size - 1
#=> 8
We may document this regular expression by writing it in free-spacing mode:
r = /
(?<!\/) # match is not to be preceded by '/' (negative lookbehind)
\/ # march '/'
(?!\/) # match is not to be followed by '/' (negative lookahead)
/x
The lookarounds prevent '//'
in str
from being matched.
We now ask ourselves what is the maximum value of segments_per_group
that we must consider. As we require that:
nbr_groups * segments_per_group <= nbr_segments
it follows that:
segments_per_group <= nbr_segments/nbr_groups
where integer arithmetic is employed on the right. For nbr_groups = 3
, we obtain:
segments_per_group <= 8/3 => 2
We therefore may determine if str
contains (at least) nbr_groups
contiguous, equal groups as follows:
(1..nbr_segments/nbr_groups).any? do |segs_per_group|
contiguous_fixed_group_size?(str, segs_per_group, nbr_groups)
end
#=> true
We may wrap this in a method:
def contiguous?(str, nbr_groups)
nbr_segments = str.scan(/(?<!\/)\/(?!\/)/).size - 1
(1..nbr_segments/nbr_groups).any? do |segs_per_grp|
contiguous_fixed_group_size?(str, segs_per_grp, nbr_groups)
end
end
Constructing the method contiguous_fixed_group_size?
This method may be written as follows:
def contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
r = /((?:\/[^\/]+){#{segments_per_group}})\1{#{nbr_groups-1}}/
str.match?(r)
end
For
str = s2
segments_per_group = 2
nbr_groups = 3
the regular expression is:
r #=> /((?:\/[^\/]+){2})\1{2}\//
Here it is written in free-spacing mode:
r = /
(?<!\/) # match is not to be preceded by a forward slash
# (negative lookbehind)
( # begin capture group 1
(?: # begin non-capture group
\/[^\/]+ # match '/' followed by 1+ char other than '/'
) # end non-capture group
{#{segments_per_group}} # execute non-capture group segments_per_group times
) # end capture group 1
\1{#{nbr_groups-1}} # execute contents of capture group 1
# nbr_groups-1 times
\/ # match '/'
/x # free-spacing regex definition mode
Examples
str
as defined above.
contiguous?(str, 3) #=> true
contiguous?(str, 2) #=> true
contiguous?(str, 1) #=> true
contiguous?(str, 4) #=> false
str = 'http://www.example.com/dog/baz/biz/baz/bix/baz/biz/cat/'
contiguous?(str, 3) #=> false
contiguous?(str, 2) #=> false
contiguous?(str, 1) #=> true
Upvotes: 0
Reputation: 4132
Change .
to [^\/]
. This should decrease the complexity of the regex since it won't be trying to match "any" character.
require 'benchmark'
Benchmark.bm do |b|
(10..15).each do |num|
str = '/foo/bar' * num
puts str
b.report("#{num} repeats:") { /(([^\/]+\/)+)\1\1/ === str }
end
end
10 repeats: 0.000000 0.000000 0.000000 ( 0.000015)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
11 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
12 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
13 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
14 repeats: 0.000000 0.000000 0.000000 ( 0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
15 repeats: 0.000000 0.000000 0.000000 ( 0.000005)
Upvotes: 2