Reputation: 22225
I would like to create a class which can be instantiated on a String and a Regex (which describes how to tokenize a string), and provides a method next_token
, which returns the respectively next part of the string matching the regex, in the way that String#scan
is working. For instance, if I do a
t = Tokenizer.new('abcdefgh', /.../)
a = t.next_token
b = t.next_token
c = t.next_token
should set a
to 'abc'
and b
to 'def'
and c
to nil
. This is an obvious and simple solution:
class Tokenizer
def initialize(str, reg)
@tokenized_str = str.scan(reg)
@next_ind = 0
end
def next_token
@tokenized_str[@next_ind].tap { @next_ind += 1 }
end
end
This solution requires that the whole string is split apart into an array in the constructor. I would like to implement a "lazy" approach, where the next token is calculated only when the call to next_token
is issued. Can someone suggest how to do it? Actually, String#scan
must have such a generator already built in, because we can call it with a block, but I don't see how to make use of it in my case.
I wonder whether this is a good way to use a Fiber
, because what I'm doing here smells like co-routines, but perhaps there is an easier solution for this kind of problem. Performance will also be an issue, because my application will make heavy use of the Tokenizer
class.
Upvotes: 2
Views: 109
Reputation: 110675
You can use the method String#gsub.
class Tokenizer
def initialize(str, reg)
@token_enum = str.gsub(reg)
end
def next_token
@token_enum.next
end
end
t = Tokenizer.new('bacdefaghi', /(?<=a)../)
#=> #<Tokenizer:0x00005af867bfc6f0 @tokenized_str=
# #<Enumerator: "bacdefaghi":gsub(/(?<=a)../)>>
t.next_token #=> "cd"
t.next_token #=> "gh"
t.next_token #=> StopIteration (iteration reached an end)
Upvotes: 2
Reputation: 121000
You are nearly there. You need an enumerator instance.
class Tokenizer
def initialize(str, reg)
# THIS ⇓⇓⇓⇓⇓⇓⇓⇓
@tokenized_str = str.enum_for(:scan, reg)
end
def next_token
@tokenized_str.next
end
end
Beware that Enumerator#next
raises StopIteration
if there is nothing left to iterate, so you probably want to handle it somehow.
Upvotes: 4
Reputation: 23317
You can use StringScanner#scan_until
and then remove the part matching the pattern with String#split
or String#gsub
:
ss = StringScanner.new('a-b-c-d-e-f-g')
#=> #<StringScanner 0/13 @ "a-b-c...">
while s = ss.scan_until(/-/)
puts s.gsub(/-/, '') # or s.split(/-/)
end
#a
#b
#c
#d
#e
#f
#=> nil
Upvotes: 4