Tim
Tim

Reputation: 6209

Given a string, how to iterate through many regexes to find a match?

Given a string, and a mapping of regexes to integers, I want to find out which integer the string maps to (assume the string will match exactly one of those regexes)

Is it inefficient to just iterate through the hash of regexes, trying each regex against the string, then outputting the value? Certainly I cannot explicitly enumerate all possible string => integer mappings, but it seems bad to try matching each regex in bunch of regexes.

Upvotes: 1

Views: 1056

Answers (4)

the Tin Man
the Tin Man

Reputation: 160571

RegexpTrie, which wasn't around when I last looked for something like it, helps with this sort of problem:

require 'regexp_trie'

sentence = 'life on the mississippi'
words_ary = %w[the sip life]

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i 
# => /\b(?:(?:the|sip|life))\b/i

words_to_ints = words_ary.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}

sentence_words = sentence.split
# => ["life", "on", "the", "mississippi"]

word_hits = sentence_words.map { |w| w[words_regex] }
# => ["life", nil, "the", nil]

nil means there was no match of that word in the regular expression.

words_to_ints.values_at(*word_hits)
# => [2, nil, 0, nil]

Again, nil means there was no match. nil values could be ignored using:

word_hits = sentence_words.map { |w| w[words_regex] }.compact
# => ["life", "the"]

words_to_ints.values_at(*word_hits)
# => [2, 0]

Similarly, if you want to scan a sentence for word matches instead of individual words:

require 'regexp_trie'

sentence = 'life on the mississippi'
words = %w[the sip life]

words_regex = /\b(?:#{RegexpTrie.union(words, option: Regexp::IGNORECASE).source})\b/i 
# => /\b(?:(?:the|sip|life))\b/i

words_to_ints = words.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}

word_hits = sentence.scan(words_regex)
# => ["life", "the"]

words_to_ints.values_at(*word_hits)
# => [2, 0]

Perl has a really useful module for this sort of thing called Regexp::Assemble, which lets you combine regexes into one big one, then search a string, returning the hits. You can ask it to tell which pattern was used if you want to know.

Ruby doesn't have such a module, but this gets kinda close:

patterns = {
  /(foo)/ => 1,
  /(bar)/ => 2
}

pattern_union = Regexp.union(patterns.keys)

pattern_union # => /(?-mix:(foo))|(?-mix:(bar))/

str = 'foo some text'

if (pattern_union =~ str)

  # these show what are being processed...
  pattern_union.match(str).captures # => ["foo", nil]
  pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] } # => [/(foo)/]

  # process it...
  matched_pattern_values = patterns.values_at(*pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] })

  # here's what we got
  matched_pattern_values # => [1]

end

There's probably a way to do it in one line, but this works.

I think its important to avoid having to iterate over patterns to look for hits in strings if at all possible, because they can slow down badly as the size of the text or number of patterns increase.

See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.

Upvotes: 1

Confusion
Confusion

Reputation: 16851

You say 'it seems bad', but in the end, there is probably nothing you can do: you will have to match each string to a succession of regexes until one matches. You can memoize the results and be smart in other ways, like 'if this regex fails, these other 10 will also fail', but those are all performance optimizations that you may not need.

The simplest optimization may be to create groups of regexes with shared characteristics and first test which group the string is in. If string =~ /^a/ is nil, all other regexes that test for a string starting with a don't need to be tested anymore.

Upvotes: -1

Theo
Theo

Reputation: 132902

Just do as you suggest, loop over the hash of regex/numbers and return the first that matches a string:

def find_match(regex_mapping, str)
  regex_mapping.each do |regex, n|
    return n if str =~ regex
  end
  return nil
end

The only thing to say about efficiency is this: it probably doesn't matter anyway. Just write your code as clearly and simply as you can, and then, in the end, if it is to slow, run it through a profiler (for example the absolutely awesome perftools.rb) and see what the hotspots are. Optimize those. Don't optimize before you've written any code.

That said, a simple optimization you can do in this case, which doesn't cost you anything, is to put the regexes into the mapping hash in an order such that the most likely to match comes first, that way fewer comparisons have to be made (but this is a probabilistic optimization, the worst case running time remains the same). This only applies to Ruby 1.9 though, since hashes don't retain their insertion order in 1.8.

Upvotes: 1

haroldcampbell
haroldcampbell

Reputation: 1542

It depends on how complicated your regexes are. If you can put them in capture blocks and have the capture blocks map back to the integers that you need then you should be ok.

For instance:

(is).*(test)

has two capture blocks. This will match:

This is a test

The captures will be 1:is and 2:test

You can quickly try it on http://www.rubular.com/

Upvotes: 0

Related Questions