user1934428
user1934428

Reputation: 22225

Ruby: Building a token generator on top of a string

I would like to create a class which can be instantiated on a String and a Regex (which describes how to tokenize a string), and provides a method next_token, which returns the respectively next part of the string matching the regex, in the way that String#scan is working. For instance, if I do a

t = Tokenizer.new('abcdefgh', /.../)
a = t.next_token
b = t.next_token
c = t.next_token

should set a to 'abc' and b to 'def' and c to nil. This is an obvious and simple solution:

class Tokenizer
  def initialize(str, reg)
     @tokenized_str = str.scan(reg)
     @next_ind = 0
  end
  def next_token
     @tokenized_str[@next_ind].tap { @next_ind += 1 }
  end
end

This solution requires that the whole string is split apart into an array in the constructor. I would like to implement a "lazy" approach, where the next token is calculated only when the call to next_token is issued. Can someone suggest how to do it? Actually, String#scan must have such a generator already built in, because we can call it with a block, but I don't see how to make use of it in my case.

I wonder whether this is a good way to use a Fiber, because what I'm doing here smells like co-routines, but perhaps there is an easier solution for this kind of problem. Performance will also be an issue, because my application will make heavy use of the Tokenizer class.

Upvotes: 2

Views: 109

Answers (3)

Cary Swoveland
Cary Swoveland

Reputation: 110675

You can use the method String#gsub.

class Tokenizer
  def initialize(str, reg)
    @token_enum = str.gsub(reg)
  end

  def next_token
   @token_enum.next
  end
end

t = Tokenizer.new('bacdefaghi', /(?<=a)../)
  #=> #<Tokenizer:0x00005af867bfc6f0 @tokenized_str=
  #     #<Enumerator: "bacdefaghi":gsub(/(?<=a)../)>> 

t.next_token  #=> "cd" 
t.next_token  #=> "gh" 
t.next_token  #=> StopIteration (iteration reached an end)

Upvotes: 2

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

You are nearly there. You need an enumerator instance.

class Tokenizer
  def initialize(str, reg)
    #             THIS   ⇓⇓⇓⇓⇓⇓⇓⇓
    @tokenized_str = str.enum_for(:scan, reg)
  end
  def next_token
   @tokenized_str.next
  end
end

Beware that Enumerator#next raises StopIteration if there is nothing left to iterate, so you probably want to handle it somehow.

Upvotes: 4

mrzasa
mrzasa

Reputation: 23317

You can use StringScanner#scan_until and then remove the part matching the pattern with String#split or String#gsub:

ss = StringScanner.new('a-b-c-d-e-f-g')
#=> #<StringScanner 0/13 @ "a-b-c...">
while s = ss.scan_until(/-/)
   puts s.gsub(/-/, '') # or s.split(/-/)
end  
#a
#b
#c
#d
#e
#f
#=> nil

Upvotes: 4

Related Questions