Why do these two methods yield different results?

Question

According to all documentation, you can append an element to an array using << or .push or +=, and the result ought to be the same. I have found it isn't. Can anybody explain to me what I am getting wrong? (I am using Ruby 2.3.1.)

I have got a number of hashes. All of them contain the same keys. I would like to combine them to form one hash with all the collected values in an array. This is straightforward, you iterate through all the hashes and make a new one, collecting all the values like this:

    # arg is array of Hashes - keys must be identical
    return {} unless arg
    keys = (arg[0] ? arg[0].keys : [])

    result = keys.product([[]]).to_h # value for each key is empty array.

    arg.each do |h|
      h.each { |k,v| result[k] += [v] }
    end

    result
  end

If instead of using += I use .push or <<, I get completely weird results.

Using the following test array:

a_of_h = [{"1"=>10, "2"=>10, "3"=>10, "4"=>10, "5"=>10, "6"=>10, "7"=>10, "8"=>10, "9"=>10, "10"=>10}, {"1"=>100, "2"=>100, "3"=>100, "4"=>100, "5"=>100, "6"=>100, "7"=>100, "8"=>100, "9"=>100, "10"=>100}, {"1"=>1000, "2"=>1000, "3"=>1000, "4"=>1000, "5"=>1000, "6"=>1000, "7"=>1000, "8"=>1000, "9"=>1000, "10"=>1000}, {"1"=>10000, "2"=>10000, "3"=>10000, "4"=>10000, "5"=>10000, "6"=>10000, "7"=>10000, "8"=>10000, "9"=>10000, "10"=>10000}]

I get

merge_hashes(a_of_h)
 => {"1"=>[10, 100, 1000, 10000], "2"=>[10, 100, 1000, 10000], "3"=>[10, 100, 1000, 10000], "4"=>[10, 100, 1000, 10000], "5"=>[10, 100, 1000, 10000], "6"=>[10, 100, 1000, 10000], "7"=>[10, 100, 1000, 10000], "8"=>[10, 100, 1000, 10000], "9"=>[10, 100, 1000, 10000], "10"=>[10, 100, 1000, 10000]}

as I expect, but if I use h.each { |k,v| result[k] << v } instead I get

buggy_merge_hashes(a_of_h)
 => {"1"=>[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], "2"=>[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], "3"=>[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], "4"=>[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], "5"=>[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], ...}

(I cut the rest.)

What is it I don't know here?

Amadan · Accepted Answer

<< and #push are destructive operations (they change the receiver).

+ (and consequently += as well) is a non-destructive operation (it returns a new object, leaving the receiver unchanged).

While they seem to be doing the same thing, this apparently small difference is crucial.

This comes into play due to another error: all of your subarrays in result start off as the same object. If you add to one of them, you add to all of them.

Why is this not an issue if you use +=? Because result[k] += [v] is the same as result[k] = result[k] += [v] (I'm lying here, there's a subtle difference, but it is not relevant here and just accept that they're the same for now to not get more confused :D ); and as + is non-destructive, result[k] + [v] is a different object than result[k]; when you update the value in the array with this assignment, you are not using the starting [] object any more, and the reference sharing error can't bite you any more.

A better way to create your result array would be one of these:

result = Array.new(keys.size) { [] }
result = keys.map { [] }

which will create a new array object for each element.

However, I would write it all quite differently:

a_of_h.each_with_object(Hash.new { |h, k| h[k] = [] }) { |h, r|
  h.each { |k, v| r[k] << v }
}

each_with_hash will give the passed object to the block as an additional argument (here r, for result), and will return it when the method is done. The argument — the object that will be in r — will be a hash with a default_proc: every time we try to get a key that's not inside yet, it will insert a new array there (i.e. instead of trying to pre-populate our result object, do it on-demand). Then we just go through each of the hashes in your array, and insert the value into the result hash without worrying if the key is there or not.

Why do these two methods yield different results?

Answers (2)

Related Questions