Crashalot
Crashalot

Reputation: 34523

Unexpectedly high memory usage in Ruby: 500B normal for empty hash?

our program creates a master hash where each key is a symbol representing an ID (about 10-20 characters). each value is an empty hash.

the master hash has about 800K records.

yet we're seeing ruby memory hit almost 400MB.

this suggests each key/value pair (symbol + empty hash) consumes ~500B each.

is this normal for ruby?

code below:

def load_app_ids
      cols   = get_columns AppFile
      id_col = cols[:application_id]

      each_record AppFile do |r|
        @apps[r[id_col].intern] = {}
      end
end

    # Takes a line, strips the record seperator, and return
    # an array of fields
    def split_line(line)
      line.gsub(RecordSeperator, "").split(FieldSeperator)
    end

    # Run a block on each record in a file, up to 
    # @limit records
    def each_record(filename, &block)
      i = 0

      path = File.join(@dir, filename)
      File.open(path, "r").each_line(RecordSeperator) do |line|
        # Get the line split into columns unless it is
        # a comment
        block.call split_line(line) unless line =~ /^#/

        # This import can take a loooong time.
        print "\r#{i}" if (i+=1) % 1000 == 0
        break if @limit and i >= @limit
      end
      print "\n" if i > 1000
    end

    # Return map of column name symbols to column number
    def get_columns(filename)
      path = File.join(@dir, filename)
      description = split_line(File.open(path, &:readline))

      # Strip the leading comment character
      description[0].gsub!(/^#/, "")

      # Return map of symbol to column number
      Hash[ description.map { |str| [ str.intern, description.index(str) ] } ]
    end

Upvotes: 1

Views: 389

Answers (1)

Neil Slater
Neil Slater

Reputation: 27207

I would say this is normal for Ruby. I don't have metrics for space used by each data structure, but in general basic Ruby works poorly on this kind of large structure. It has to allow for the fact that the keys and values can be any kind of object for instance, and although that is very flexible for high-level coding, it's inefficient when you don't need such arbitrary control.

If I do this in irb

h = {}
800000.times { |x| h[("test" + x.to_s).to_sym] = {} }

I get a process with 197 Mb used.

Your process has claimed more space as it created large numbers of hashes during processing - one for each row. Ruby will eventually clean up - but that doesn't happen immediately, and the memory is not returned to the OS immediately either.

Edit: I should add that I have been working with large data structures of various kinds in Ruby - the general approach if you need them is to find something coded in native extensions (or ffi) where the code can take advantage of using restricted types in an array for example. The gem narray is a good example of this for numeric arrays, vectors, matrices etc.

Upvotes: 1

Related Questions