Optimising ruby regexp -- lots of match groups

Question

I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:

/\A(?<__anonymous_-1038694222803470993>(?-mix:
+))|\A(?<__anonymous_-1394418499721420065>(?-mix://[\A
]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?(?-mix:let\s))|\A(?(?-mix:in\s))|\A(?(?-mix:class\s))|\A(?(?-mix:def\s))|\A(?(?-mix:defm\s))|\A(?(?-mix:multiclass\s))|\A(?(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?(?-mix:"[\A"]*"))|\A(?(?-mix:[0-9]+))/

I'm matching it to my string producing a MatchData where exactly one token is parsed:

bigregex =~ "
 ... garbage"
puts $~.inspect

Which outputs

So, the regex actually matched the " " part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).

I could not find any option other than iterating over #names:

m.names.each do |n|
  if m[n]
    type = n.to_sym
    resolved_type = (n.start_with?('__anonymous_') ? nil : type)
    val = m[n]
    break
  end
end

which verifies that the match group did have a match.

The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the @input[@pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the @pos in it).

You can check the full code at GH repo.

Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?

joofsh · Accepted Answer

You can do this using the regexp methods .captures() and .names():

matching_string = "
 ...garbage"   # or whatever this really is in your code
@input = matching_string.match bigregex   # bigregex = your regex
arr = @input.captures

arr.each_with_index do |value, index|     
  if not value.nil?
    the_name_you_want = @input.names[index]
  end
end

Or if you expect multiple successful values, you could do:

success_names_arr = []
success_names_arr.push(@input.names[index]) #within the above loop

Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.

Optimising ruby regexp -- lots of match groups

Answers (2)

Related Questions