user789
user789

Reputation: 21

How to batch enumerables in ruby

In my quest to understand ruby's enumerable, I have something similar to the following

FileReader.read(very_big_file)
          .lazy
          .flat_map {|line| get_array_of_similar_words } # array.size is ~10
          .each_slice(100) # wait for 100 items
          .map{|array| process_100_items}

As much as each flat_map call emits an array of ~10 items, I was expecting the each_slice call to batch the items in 100's but that is not the case. I.e wait until there are 100 items before passing them to the final .map call.

How do I achieve functionality similar to the buffer function in reactive programming?

Upvotes: 0

Views: 2314

Answers (1)

Cary Swoveland
Cary Swoveland

Reputation: 110675

To see how lazy affects the calculations, let's look at an example. First construct a file:

str =<<~_
Now is the
time for all
good Ruby coders
to come to
the aid of
their bowling
team
_

fname = 't' 
File.write(fname, str)
  #=> 82

and specify the slice size:

slice_size = 4

Now I will read lines, one-by-one, split the lines into words, remove duplicate words and then append those words to an array. As soon as the array contains at least 4 words I will take the first four and map them into the longest word of the 4. The code to do that follows. To show how the calculations progress I will salt the code with puts statements. Note that IO::foreach without a block returns an enumerator.

IO.foreach(fname).
   lazy.
   tap { |o| puts "o1 = #{o}" }.
   flat_map { |line|
     puts "line = #{line}"
     puts "line.split.uniq = #{line.split.uniq} "
     line.split.uniq }.
   tap { |o| puts "o2 = #{o}" }.
   each_slice(slice_size).
   tap { |o| puts "o3 = #{o}" }.
   map { |arr|
     puts "arr = #{arr}, arr.max = #{arr.max_by(&:size)}"
     arr.max_by(&:size) }.
   tap { |o| puts "o3 = #{o}" }.
   to_a
  #=> ["time", "good", "coders", "bowling", "team"] 

The following is displayed:

o1 = #<Enumerator::Lazy:0x00005992b1ab6970>
o2 = #<Enumerator::Lazy:0x00005992b1ab6880>
o3 = #<Enumerator::Lazy:0x00005992b1ab6678>
o3 = #<Enumerator::Lazy:0x00005992b1ab6420>
line = Now is the
line.split.uniq = ["Now", "is", "the"] 
line = time for all
line.split.uniq = ["time", "for", "all"] 
arr = ["Now", "is", "the", "time"], arr.max = time
line = good Ruby coders
line.split.uniq = ["good", "Ruby", "coders"] 
arr = ["for", "all", "good", "Ruby"], arr.max = good
line = to come to
line.split.uniq = ["to", "come"] 
line = the aid of
line.split.uniq = ["the", "aid", "of"] 
arr = ["coders", "to", "come", "the"], arr.max = coders
line = their bowling
line.split.uniq = ["their", "bowling"] 
arr = ["aid", "of", "their", "bowling"], arr.max = bowling
line = team
line.split.uniq = ["team"] 
arr = ["team"], arr.max = team

If the line lazy. is removed the return value is the same but the following is displayed (.to_a at the end now being superfluous):

o1 = #<Enumerator:0x00005992b1a438f8>
line = Now is the
line.split.uniq = ["Now", "is", "the"] 
line = time for all
line.split.uniq = ["time", "for", "all"] 
line = good Ruby coders
line.split.uniq = ["good", "Ruby", "coders"] 
line = to come to
line.split.uniq = ["to", "come"] 
line = the aid of
line.split.uniq = ["the", "aid", "of"] 
line = their bowling
line.split.uniq = ["their", "bowling"] 
line = team
line.split.uniq = ["team"] 
o2 = ["Now", "is", "the", "time", "for", "all", "good", "Ruby",
      "coders", "to", "come", "the", "aid", "of", "their",
      "bowling", "team"]
o3 = #<Enumerator:0x00005992b1a41a08>
arr = ["Now", "is", "the", "time"], arr.max = time
arr = ["for", "all", "good", "Ruby"], arr.max = good
arr = ["coders", "to", "come", "the"], arr.max = coders
arr = ["aid", "of", "their", "bowling"], arr.max = bowling
arr = ["team"], arr.max = team
o3 = ["time", "good", "coders", "bowling", "team"]

Upvotes: 3

Related Questions