Remy Wang
Remy Wang

Reputation: 783

Rails - file.readlines vs file.gets

I want to read lines of file from SFTP server. There are more than 100000 lines in the file.

I am reading in 2 ways.

Net::SSH.start(setting.host, setting.user,
  {
    :key_data => [ key ],
    :keys => [],
    :keys_only => true
  }
) do |ssh|
  ssh.sftp.connect do |sftp|
    sftp.dir.foreach(src_dir) do |entry|
      if entry.name.include? today
        filename = "#{src_dir}/#{entry.name}"
        sftp.file.open(filename, "r") do |f|

          # Way 1
          f.readlines.each do |line|
            parse(line)
          end

          # Way 2
          while line = f.gets do
            parse(line)
          end
        end
      end
    end 
  end
end

I want to know which way is better in memory usage.

Upvotes: 0

Views: 827

Answers (2)

mechnicov
mechnicov

Reputation: 15248

Generally cycles are faster than blocks because of scope.

And arrays take much memory.

#readlines reads all of the lines in ios, and returns them in an array.

#gets reads the next line from the I/O stream

I wrote little benchmark for file with 1387085 lines.

Also added ::readlines that reads the entire file specified by name as individual lines, and returns those lines in an array and ::foreach that executes the block for every line in the named I/O port.

require 'benchmark/ips'
require 'benchmark/memory'

@path = File.join(__dir__, 'file.txt')

def open_readlines
  File.open(@path, 'r') do |f|
    f.readlines.each do |line|
      line << 'www'
    end
  end
end

def open_gets
  File.open(@path, 'r') do |f|
    while line = f.gets do
      line << 'www'
    end
  end
end

def readlines
  File.readlines(@path).each do |line|
    line << 'www'
  end
end

def foreach
  File.foreach(@path) do |line|
    line << 'www'
  end
end

%i[ips memory].each do |benchmark|
  puts benchmark

  Benchmark.send(benchmark) do |x|
    x.report('::open #readlines') { open_readlines }
    x.report('::open #gets') { open_gets }
    x.report('::readlines') { readlines }
    x.report('::foreach') { foreach }

    x.compare!
  end
end

And results are:

ips
Warming up --------------------------------------
   ::open #readlines     1.000  i/100ms
        ::open #gets     1.000  i/100ms
         ::readlines     1.000  i/100ms
           ::foreach     1.000  i/100ms
Calculating -------------------------------------
   ::open #readlines      0.575  (± 0.0%) i/s -      3.000  in   5.397538s
        ::open #gets      0.746  (± 0.0%) i/s -      4.000  in   5.381583s
         ::readlines      0.570  (± 0.0%) i/s -      3.000  in   5.434956s
           ::foreach      0.826  (± 0.0%) i/s -      5.000  in   6.057936s

Comparison:
           ::foreach:        0.8 i/s
        ::open #gets:        0.7 i/s - 1.11x  slower
   ::open #readlines:        0.6 i/s - 1.44x  slower
         ::readlines:        0.6 i/s - 1.45x  slower

memory
Calculating -------------------------------------
   ::open #readlines   822.274M memsize (     8.424k retained)
                         2.774M objects (     1.000  retained)
                        50.000  strings (     0.000  retained)
        ::open #gets   810.638M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
         ::readlines   822.274M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)
           ::foreach   810.638M memsize (     0.000  retained)
                         2.774M objects (     0.000  retained)
                        50.000  strings (     0.000  retained)

Comparison:
           ::foreach:  810638012 allocated
        ::open #gets:  810638052 allocated - 1.00x more
         ::readlines:  822274324 allocated - 1.01x more
   ::open #readlines:  822274364 allocated - 1.01x more

Upvotes: 1

fphilipe
fphilipe

Reputation: 10056

What do the docs say? (Note that File is a subclass of IO. The methods #readlines and #gets are defined on IO.)

IO#readlines:

Reads all of the lines […], and returns them in an array.

IO#gets:

Reads the next “line” from the I/O stream.

Thus, I expect the latter to be better in terms of memory usage as it doesn't load the entire file into memory.

Upvotes: 2

Related Questions