Reputation: 6209
Given a string, and a mapping of regexes to integers, I want to find out which integer the string maps to (assume the string will match exactly one of those regexes)
Is it inefficient to just iterate through the hash of regexes, trying each regex against the string, then outputting the value? Certainly I cannot explicitly enumerate all possible string => integer mappings, but it seems bad to try matching each regex in bunch of regexes.
Upvotes: 1
Views: 1056
Reputation: 160571
RegexpTrie, which wasn't around when I last looked for something like it, helps with this sort of problem:
require 'regexp_trie'
sentence = 'life on the mississippi'
words_ary = %w[the sip life]
words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:the|sip|life))\b/i
words_to_ints = words_ary.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}
sentence_words = sentence.split
# => ["life", "on", "the", "mississippi"]
word_hits = sentence_words.map { |w| w[words_regex] }
# => ["life", nil, "the", nil]
nil
means there was no match of that word in the regular expression.
words_to_ints.values_at(*word_hits)
# => [2, nil, 0, nil]
Again, nil
means there was no match. nil
values could be ignored using:
word_hits = sentence_words.map { |w| w[words_regex] }.compact
# => ["life", "the"]
words_to_ints.values_at(*word_hits)
# => [2, 0]
Similarly, if you want to scan a sentence for word matches instead of individual words:
require 'regexp_trie'
sentence = 'life on the mississippi'
words = %w[the sip life]
words_regex = /\b(?:#{RegexpTrie.union(words, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:the|sip|life))\b/i
words_to_ints = words.each_with_index.to_h
# => {"the"=>0, "sip"=>1, "life"=>2}
word_hits = sentence.scan(words_regex)
# => ["life", "the"]
words_to_ints.values_at(*word_hits)
# => [2, 0]
Perl has a really useful module for this sort of thing called Regexp::Assemble, which lets you combine regexes into one big one, then search a string, returning the hits. You can ask it to tell which pattern was used if you want to know.
Ruby doesn't have such a module, but this gets kinda close:
patterns = {
/(foo)/ => 1,
/(bar)/ => 2
}
pattern_union = Regexp.union(patterns.keys)
pattern_union # => /(?-mix:(foo))|(?-mix:(bar))/
str = 'foo some text'
if (pattern_union =~ str)
# these show what are being processed...
pattern_union.match(str).captures # => ["foo", nil]
pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] } # => [/(foo)/]
# process it...
matched_pattern_values = patterns.values_at(*pattern_union.match(str).captures.zip(patterns.keys).find_all{ |c| c[0] }.map{ |c| c[1] })
# here's what we got
matched_pattern_values # => [1]
end
There's probably a way to do it in one line, but this works.
I think its important to avoid having to iterate over patterns to look for hits in strings if at all possible, because they can slow down badly as the size of the text or number of patterns increase.
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more about using Regexp::Assemble from Ruby.
Upvotes: 1
Reputation: 16851
You say 'it seems bad', but in the end, there is probably nothing you can do: you will have to match each string to a succession of regexes until one matches. You can memoize the results and be smart in other ways, like 'if this regex fails, these other 10 will also fail', but those are all performance optimizations that you may not need.
The simplest optimization may be to create groups of regexes with shared characteristics and first test which group the string is in. If string =~ /^a/ is nil, all other regexes that test for a string starting with a don't need to be tested anymore.
Upvotes: -1
Reputation: 132902
Just do as you suggest, loop over the hash of regex/numbers and return the first that matches a string:
def find_match(regex_mapping, str)
regex_mapping.each do |regex, n|
return n if str =~ regex
end
return nil
end
The only thing to say about efficiency is this: it probably doesn't matter anyway. Just write your code as clearly and simply as you can, and then, in the end, if it is to slow, run it through a profiler (for example the absolutely awesome perftools.rb) and see what the hotspots are. Optimize those. Don't optimize before you've written any code.
That said, a simple optimization you can do in this case, which doesn't cost you anything, is to put the regexes into the mapping hash in an order such that the most likely to match comes first, that way fewer comparisons have to be made (but this is a probabilistic optimization, the worst case running time remains the same). This only applies to Ruby 1.9 though, since hashes don't retain their insertion order in 1.8.
Upvotes: 1
Reputation: 1542
It depends on how complicated your regexes are. If you can put them in capture blocks and have the capture blocks map back to the integers that you need then you should be ok.
For instance:
(is).*(test)
has two capture blocks. This will match:
This is a test
The captures will be 1:is
and 2:test
You can quickly try it on http://www.rubular.com/
Upvotes: 0