transient_panda
transient_panda

Reputation: 235

How to get possibly overlapping matches in a string

I'm looking for a way, either in Ruby or Javascript, that will give me all matches, possibly overlapping, within a string against a regexp.


Let's say I have str = "abcadc", and I want to find occurrences of a followed by any number of characters, followed by c. The result I'm looking for is ["abc", "adc", "abcadc"]. Any ideas on how I can accomplish this?

str.scan(/a.*c/) will give me ["abcadc"], str.scan(/(?=(a.*c))/).flatten will give me ["abcadc", "adc"].

Upvotes: 22

Views: 3235

Answers (8)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

In JS:

function extract_ov_matches(r, s) {
  var res = [], cur;
  r = RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1').replace('g',''));
  for (var q = 0; q < s.length; ++q)
    for (var w = q; w <= s.length; ++w)
      if (r.test(cur = s.substring(q, w)))
        res.push(cur);
  return res;
}
document.body.innerHTML += "<pre>" + JSON.stringify(extract_ov_matches( /a.*c/g, 'abcadc' ), 0, 4) + "</pre>";

The point here is that you need to create all possible permutations of the input string and then return those that FULLY match the supplied pattern.

Overview of the extract_ov_matches function

  • r is the supplied regex (a compiled regex object, with flags)
  • s is the input string
  • RegExp('^(?:' + r.source + ')$', r.toString().replace(/^[\s\S]*\/(\w*)$/, '$1').replace('g','')); recreates the regex supplied with anchors (^ for start of string and $ for end of string) to match the whole string and the g flag is removed (because the regex will be used with RegExp#test)
  • for (var q = 0; q < s.length; ++q) for (var w = q; w <= s.length; ++w) are used to create all input string permutations
  • if (r.test(cur = s.substring(q, w))) res.push(cur);: if the current permutation fully matches the pattern, it is added to res, that will get returned eventually.

Upvotes: 8

Patrick Roberts
Patrick Roberts

Reputation: 51916

This JavaScript approach offers an advantage over Wiktor's answer by lazily iterating the substrings of a given string using a generator function, which allows you to consume a single match at a time for very large input strings using a for...of loop, rather than generating a whole array of matches at once, which could lead to out-of-memory exceptions since the amount of substrings for a string grows quadratically with length:

function * substrings (str) {
  for (let length = 1; length <= str.length; length++) {
    for (let i = 0; i <= str.length - length; i++) {
      yield str.slice(i, i + length);
    }
  }
}

function * matchSubstrings (str, re) {
  const subre = new RegExp(`^${re.source}$`, re.flags);
  
  for (const substr of substrings(str)) {
    if (subre.test(substr)) yield substr;
  }
}

for (const match of matchSubstrings('abcabc', /a.*c/)) {
  console.log(match);
}

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110725

Here's an approach that is similar to @ndn's and @Mark's that works with any string and regex. I've implemented this as a method of String because that's where I'd like to see it. Wouldn't it be a great compliment to String#[] and String#scan?

class String
  def all_matches(regex)
    return [] if empty?
    r = /^#{regex}$/
    1.upto(size).with_object([]) { |i,a|
      a.concat(each_char.each_cons(i).map(&:join).select { |s| s =~ r }) }
  end
end

'abcadc'.all_matches /a.*c/
  # => ["abc", "abcadc", "adc"]
'aaabaaa'.all_matches(/a.*a/)
  #=> ["aa", "aa", "aa", "aa", "aaa", "aba", "aaa", "aaba", "abaa", "aaaba",
  #    "aabaa", "abaaa", "aaabaa", "aabaaa", "aaabaaa"] 

Upvotes: 5

guest271314
guest271314

Reputation: 1

Approach of RegExp /(a.c)|(a.*c)/g is to match "a" character followed by any character followed by "c" ; "a.*c" is to match "a" followed by any character followed by preceding character followed by "c" character ; note RegExp at (a.*c) could probably be improved. Condition at if checks if last character in input string is "c" , if true , push full input string to res results array

var str = "abcadc"
, res = str.match(/(a.c)|(a.*c)/g); 
if (str[str.length - 1] === "c") res.push(str);

document.body.textContent = res.join(" ")

Upvotes: 4

aef
aef

Reputation: 4698

In Ruby you could achieve the expected result using:

str = "abcadc"
[/(a[^c]*c)/, /(a.*c)/].flat_map{ |pattern| str.scan(pattern) }.reduce(:+)
# => ["abc", "adc", "abcadc"]

Whether this way works for you is highly dependent on what you really want to achieve.

I tried to put this into a single expression but I couldn't make it work. I would really like to know if there is some scientific reason this cannot be parsed by regular expressions or if I just don't know enough about Ruby's parser Oniguruma to do it.

Upvotes: 11

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

▶ str = "abcadc"
▶ from = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'a' }.compact
▶ to   = str.split(/(?=\p{L})/).map.with_index { |c, i| i if c == 'c' }.compact
▶ from.product(to).select { |f,t| f < t }.map { |f,t| str[f..t] }
#⇒ [
#  [0] "abc",
#  [1] "abcadc",
#  [2] "adc"
# ]

I believe, that there is a fancy way to find all indices of a character in a string, but I was unable to find it :( Any ideas?

Splitting on “unicode char boundary” makes it to work with strings like 'ábĉ' or 'Üve Østergaard'.

For more generic solution, that accepts any “from” and “to” sequences, one should introduce just a little modification: find all indices of “from” and “to” in the string.

Upvotes: 5

Mark Reed
Mark Reed

Reputation: 95315

You want all possible matches, including overlapping ones. As you've noted, the lookahead trick from "How to find overlapping matches with a regexp?" doesn't work for your case.

The only thing I can think of that will work in the general case is to generate all of the possible substrings of the string and check each one against an anchored version of the regex. This is brute-force, but it works.

Ruby:

def all_matches(str, regex)
  (n = str.length).times.reduce([]) do |subs, i|
     subs += [*i..n].map { |j| str[i,j-i] }
  end.uniq.grep /^#{regex}$/
end

all_matches("abcadc", /a.*c/) 
#=> ["abc", "abcadc", "adc"]

Javascript:

function allMatches(str, regex) {
  var i, j, len = str.length, subs={};
  var anchored = new RegExp('^' + regex.source + '$');
  for (i=0; i<len; ++i) {
    for (j=i; j<=len; ++j) {
       subs[str.slice(i,j)] = true;
    }
  }
  return Object.keys(subs).filter(function(s) { return s.match(anchored); });
}

Upvotes: 8

ndnenkov
ndnenkov

Reputation: 36110

def matching_substrings(string, regex)
  string.size.times.each_with_object([]) do |start_index, maching_substrings|
    start_index.upto(string.size.pred) do |end_index|
      substring = string[start_index..end_index]
      maching_substrings.push(substring) if substring =~ /^#{regex}$/
    end
  end
end

matching_substrings('abcadc', /a.*c/) # => ["abc", "abcadc", "adc"]
matching_substrings('foobarfoo', /(\w+).*\1/) 
  # => ["foobarf",
  #     "foobarfo",
  #     "foobarfoo",
  #     "oo",
  #     "oobarfo",
  #     "oobarfoo",
  #     "obarfo",
  #     "obarfoo",
  #     "oo"]
matching_substrings('why is this downvoted?', /why.*/)
  # => ["why",
  #     "why ",
  #     "why i",
  #     "why is",
  #     "why is ",
  #     "why is t",
  #     "why is th",
  #     "why is thi",
  #     "why is this",
  #     "why is this ",
  #     "why is this d",
  #     "why is this do",
  #     "why is this dow",
  #     "why is this down",
  #     "why is this downv",
  #     "why is this downvo",
  #     "why is this downvot",
  #     "why is this downvote",
  #     "why is this downvoted",
  #     "why is this downvoted?"]

Upvotes: 11

Related Questions