MothOnMars
MothOnMars

Reputation: 2369

Regular expression to match a string with a repeating pattern

I'm trying to find a regex that matches URLs with three or more repeating segments (and may include any number of directories) such as:

and not match URLs like:

First I tried:

re1 = /((.+\/)+)\1\1/

which works:

re1 === s1 #=> true
re1 === s2 #=> true

but as the number of segments grows, the regex match takes exponentially longer:

require 'benchmark'
Benchmark.bm do |b|
  (10..15).each do |num|
    str = '/foo/bar' * num
    puts str
    b.report("#{num} repeats:") { /((.+\/)+)\1\1/ === str }
  end
end

       user     system      total        real
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    10 repeats:  0.060000   0.000000   0.060000 (  0.054839)
    /foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    11 repeats:  0.210000   0.000000   0.210000 (  0.213492)
    /foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    12 repeats:  0.870000   0.000000   0.870000 (  0.871879)
    /foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    13 repeats:  3.370000   0.010000   3.380000 (  3.399224)
    /foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    14 repeats: 13.580000   0.110000  13.690000 ( 13.790675)
    /foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
    15 repeats: 54.090000   0.210000  54.300000 ( 54.562672)

Then, I tried a regex similar to one given here:

re2 = /(\/.+)(?=.*\1)\1\1/

which doesn't have performance issues, and matches the strings I'd like to to match:

re2 === s3 #=> true

but also matches strings I don't want it to match, such as:

re2 === s4 #=> true, but should be false

I'm close with the second regex. What am I missing?

Upvotes: 3

Views: 171

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

Definitions

Suppose:

str = 'http://www.example.com/dog/baz/biz/baz/biz/baz/biz/cat/'

We can define '/dog', '/baz', '/biz' and so on as segments. A group is comprised of one or more contiguous segments, such as '/dog', '/baz', '/dog/baz', '/baz', '/baz/biz', 'biz/baz', '/baz/biz/baz' and so on.

The Problem

It is my understanding that the problem is to determine if a given string contains three (or more) contiguous and equal groups followed by a forward slash. s2 meets this test with the following substring:

'/baz/biz/baz/biz/baz/biz/'

Algorithm

I do not believe a single regular expression can be fashioned to make this determination, but we can write a regular expression to determine if at least three (or some arbitrary number) of contiguous, equal groups are present, given the number of segments per group. Suppose this is done by a method named contiguous_fixed_group_size?, which, is called as follows:

contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)

and returns true or false. To ensure that the string has at least 3 contiguous, equal groups (for a given value of segments_per_group), we call this method with nbr_groups = 3. I think it best to briefly postpone the construction of this method; for the present just assume it is available to us.

The approach I've taken is to call this method with different values of segments_per_group and determine if the method returns true for at least one of those values.

Main method

The first step is to determine the number of segments in the string (where str holds the string given above):

 r = /(?<!\/)\/(?!\/)/
 nbr_segments = str.scan(r).size - 1 
   #=> 8

We may document this regular expression by writing it in free-spacing mode:

 r = /
     (?<!\/)  # match is not to be preceded by '/' (negative lookbehind)
     \/       # march '/' 
     (?!\/)   # match is not to be followed by '/' (negative lookahead)
     /x

The lookarounds prevent '//' in str from being matched.

We now ask ourselves what is the maximum value of segments_per_group that we must consider. As we require that:

nbr_groups * segments_per_group <= nbr_segments

it follows that:

segments_per_group <= nbr_segments/nbr_groups

where integer arithmetic is employed on the right. For nbr_groups = 3, we obtain:

segments_per_group <= 8/3 => 2

We therefore may determine if str contains (at least) nbr_groups contiguous, equal groups as follows:

(1..nbr_segments/nbr_groups).any? do |segs_per_group|
  contiguous_fixed_group_size?(str, segs_per_group, nbr_groups)
end
  #=> true

We may wrap this in a method:

def contiguous?(str, nbr_groups)
  nbr_segments = str.scan(/(?<!\/)\/(?!\/)/).size - 1
  (1..nbr_segments/nbr_groups).any? do |segs_per_grp|
    contiguous_fixed_group_size?(str, segs_per_grp, nbr_groups)
  end
end

Constructing the method contiguous_fixed_group_size?

This method may be written as follows:

def contiguous_fixed_group_size?(str, segments_per_group, nbr_groups)
  r = /((?:\/[^\/]+){#{segments_per_group}})\1{#{nbr_groups-1}}/ 
  str.match?(r)
end

For

str = s2
segments_per_group = 2
nbr_groups = 3

the regular expression is:

r #=> /((?:\/[^\/]+){2})\1{2}\//

Here it is written in free-spacing mode:

r = /
    (?<!\/)                    # match is not to be preceded by a forward slash
                               # (negative lookbehind)    
    (                          # begin capture group 1
      (?:                      # begin non-capture group
        \/[^\/]+               # match '/' followed by 1+ char other than '/'
      )                        # end non-capture group 
      {#{segments_per_group}}  # execute non-capture group segments_per_group times
    )                          # end capture group 1
    \1{#{nbr_groups-1}}        # execute contents of capture group 1
                               # nbr_groups-1 times 
    \/                         # match '/'
    /x                         # free-spacing regex definition mode

Examples

str as defined above.

contiguous?(str, 3) #=> true
contiguous?(str, 2) #=> true
contiguous?(str, 1) #=> true
contiguous?(str, 4) #=> false

str = 'http://www.example.com/dog/baz/biz/baz/bix/baz/biz/cat/'
contiguous?(str, 3) #=> false
contiguous?(str, 2) #=> false
contiguous?(str, 1) #=> true

Upvotes: 0

Josh Voigts
Josh Voigts

Reputation: 4132

Change . to [^\/]. This should decrease the complexity of the regex since it won't be trying to match "any" character.

require 'benchmark'

Benchmark.bm do |b|
  (10..15).each do |num|
    str = '/foo/bar' * num
    puts str
    b.report("#{num} repeats:") { /(([^\/]+\/)+)\1\1/ === str }
  end
end

10 repeats:  0.000000   0.000000   0.000000 (  0.000015)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
11 repeats:  0.000000   0.000000   0.000000 (  0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
12 repeats:  0.000000   0.000000   0.000000 (  0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
13 repeats:  0.000000   0.000000   0.000000 (  0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
14 repeats:  0.000000   0.000000   0.000000 (  0.000004)
/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar/foo/bar
15 repeats:  0.000000   0.000000   0.000000 (  0.000005)

Upvotes: 2

Related Questions