Farcaller
Farcaller

Reputation: 3070

Optimising ruby regexp -- lots of match groups

I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:

/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/

I'm matching it to my string producing a MatchData where exactly one token is parsed:

bigregex =~ "\n ... garbage"
puts $~.inspect

Which outputs

#<MatchData
 "\n"
 __anonymous_-1038694222803470993:"\n"
 __anonymous_-1394418499721420065:nil
 __anonymous_3077187815313752157:nil
 LET:nil
 IN:nil
 CLASS:nil
 DEF:nil
 DEFM:nil
 MULTICLASS:nil
 FUNCNAME:nil
 ID:nil
 STRING:nil
 NUMBER:nil>

So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).

I could not find any option other than iterating over #names:

m.names.each do |n|
  if m[n]
    type = n.to_sym
    resolved_type = (n.start_with?('__anonymous_') ? nil : type)
    val = m[n]
    break
  end
end

which verifies that the match group did have a match.

The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the @input[@pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the @pos in it).

You can check the full code at GH repo.

Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?

Upvotes: 0

Views: 377

Answers (2)

joofsh
joofsh

Reputation: 1655

You can do this using the regexp methods .captures() and .names():

matching_string = "\n ...garbage"   # or whatever this really is in your code
@input = matching_string.match bigregex   # bigregex = your regex
arr = @input.captures

arr.each_with_index do |value, index|     
  if not value.nil?
    the_name_you_want = @input.names[index]
  end
end

Or if you expect multiple successful values, you could do:

success_names_arr = []
success_names_arr.push(@input.names[index]) #within the above loop

Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.

Upvotes: 1

garyh
garyh

Reputation: 2852

I may have misunderstood this completely but but I'm assuming that all but one token is not nil and that's the one your after?

If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil value

([^\n:]+:(?!nil)[^\n\>]+)

This will match the whole token ie NAME:value.

Upvotes: 1

Related Questions