Reputation: 3070
I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:
/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/
I'm matching it to my string producing a MatchData where exactly one token is parsed:
bigregex =~ "\n ... garbage"
puts $~.inspect
Which outputs
#<MatchData
"\n"
__anonymous_-1038694222803470993:"\n"
__anonymous_-1394418499721420065:nil
__anonymous_3077187815313752157:nil
LET:nil
IN:nil
CLASS:nil
DEF:nil
DEFM:nil
MULTICLASS:nil
FUNCNAME:nil
ID:nil
STRING:nil
NUMBER:nil>
So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).
I could not find any option other than iterating over #names:
m.names.each do |n|
if m[n]
type = n.to_sym
resolved_type = (n.start_with?('__anonymous_') ? nil : type)
val = m[n]
break
end
end
which verifies that the match group did have a match.
The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the @input[@pos..-1]
to make sure that \A works as expected to match start of string (I do not discard input, just shift the @pos in it).
You can check the full code at GH repo.
Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?
Upvotes: 0
Views: 377
Reputation: 1655
You can do this using the regexp methods .captures()
and .names()
:
matching_string = "\n ...garbage" # or whatever this really is in your code
@input = matching_string.match bigregex # bigregex = your regex
arr = @input.captures
arr.each_with_index do |value, index|
if not value.nil?
the_name_you_want = @input.names[index]
end
end
Or if you expect multiple successful values, you could do:
success_names_arr = []
success_names_arr.push(@input.names[index]) #within the above loop
Pretty similar to your original idea, but if you're looking for efficiency .captures()
method should help with that.
Upvotes: 1
Reputation: 2852
I may have misunderstood this completely but but I'm assuming that all but one token is not nil
and that's the one your after?
If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil
value
([^\n:]+:(?!nil)[^\n\>]+)
This will match the whole token ie NAME:value
.
Upvotes: 1