jblittle
jblittle

Reputation: 135

Appending to an array value in a hash

I'm parsing multiple website and trying to build a hash that looks something like:

"word" => [[01.html, 2], [02.html, 7], [03.html, 4]]

where word is a given word in the index, the first value in each sublist is the file it was found in, and the second value is the number of occurrences in that given file.

I'm running into an issue where, rather than appending ["02.html", 7] inside the values list, it creates a whole new entry for "word" and puts ["02.html", 7] at the end of the hash. This results in basically giving me individual indexes for all of my websites appended after each other rather than giving me one master index.

Here is my code:

for token in tokens
   if !invindex.include?(token)
     invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
   else
     for list in invindex[token]
       if list[0] == doc_name
         list[1] += 1 #adds one to the occurrence with the same doc_name
       else
         invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
       end
     end
   end
 end
end

Hopefully it's something simple and I just missed something when I traced it on paper.

Upvotes: 0

Views: 82

Answers (3)

7stud
7stud

Reputation: 48599

I'm running into an issue where, rather than appending ["02.html", 7] inside the values list, it creates a whole new entry for "word" and puts ["02.html", 7] at the end of the hash.

I'm not seeing that:

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

doc_name = '02.html'

tokens.each do |token|
  if !invindex.include?(token)
    invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
  else
    invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
      end
    end
  end

end

p invindex

--output:--
{:word1=>[["01.html", 2]], :word2=>[["02.html", 1]], :word3=>[["02.html", 1]]}

invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name

Nope:

invindex = {
  word: [ 
    ['01.html', 2],
  ]
}

token = :word
doc_name = '02.html'

invindex[token].insert([doc_name, 7])
p invindex
invindex[token].insert(-1, ["02.html", 7])
p invindex

--output:--
{:word=>[["01.html", 2]]}
{:word=>[["01.html", 2], ["02.html", 7]]}

Array#insert() requires that you specify an index as the first argument. Typically when you want to append something to the end, you use <<:

invindex = {
  word: [ 
    ['01.html', 2],
  ]
}

token = :word
doc_name = '02.html'

invindex[token] << [doc_name, 7]
p invindex

--output:--
{:word=>[["01.html", 2], ["02.html", 7]]}  

for token in tokens

Rubyists don't use for-in loops because for-in loops call each(), so rubyists call each() directly:

tokens.each do |token|
  ...
end

Finally, indenting in ruby is 2 spaces--not 3 spaces, not 1 space, not 4 spaces. It's 2 spaces.

Applying all that to your code:

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

doc_name = '01.html'

tokens.each do |token|
  if !invindex.include?(token)
    invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
  else
    invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
      end
    end
  end

end

p invindex

--output:--
{:word1=>[["01.html", 3]], :word2=>[["01.html", 1]], :word3=>[["01.html", 1]]}

However, there is still a problem, which is due to the fact that you are changing an Array that you are stepping through--a big no-no in computer programming:

   invindex[token].each do |list|
      if list[0] == doc_name
        list[1] += 1 #adds one to the occurrence with the same doc_name
      else
        invindex[token] << [doc_name, 1]  #***PROBLEM***

Look what happens:

invindex = {
  word1: [ 
    ['01.html', 2],
  ]
}

tokens = %i[
  word1
  word2
  word3
]

%w[ 01.html 02.html].each do |doc_name|

  tokens.each do |token|
    if !invindex.include?(token)
      invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and  occurrence of 1
    else
      invindex[token].each do |list|
        if list[0] == doc_name
          list[1] += 1 #adds one to the occurrence with the same doc_name
        else
          invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
        end
      end
    end

  end
end

p invindex

--output:--
{:word1=>[["01.html", 3], ["02.html", 2]], :word2=>[["01.html", 1], ["02.html", 2]], :word3=>[["01.html", 1], ["02.html", 2]]}

Problem 1: You don't want to insert [doc_name, 1] every time the sub Array you are examining doesn't contain the doc_name--you only want to insert [doc_name, 1] after ALL the sub Arrays have been examined, and the doc_name wasn't found. If you run the example above with the starting hash:

invindex = {
  word1: [ 
    ['01.html', 2],
    ['02.html', 7],
  ]
}

...you will see that the output is even worse.

Problem 2: Appending [doc_name, 1] to the Array while you are stepping through the Array means that [doc-name, 1] will be examined, too, when the loop gets to the end of the Array--and then your loop will increment its count to 2. The rule is: don't change an Array you are stepping through because bad things will happen.

Upvotes: 1

Cary Swoveland
Cary Swoveland

Reputation: 110675

Suppose:

arr = %w| 01.html 02.html 03.html 02.html 03.html 03.html |
  #=> ["01.html", "02.html", "03.html", "02.html", "03.html", "03.html"] 

is an array of your files for a given word in the index. Then the value of that word in the hash is given by constructing the counting hash:

h = arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }
  #=> {"01.html"=>1, "02.html"=>2, "03.html"=>3}

and then converting it to an array:

h.to_a
  #=> [["01.html", 1], ["02.html", 2], ["03.html", 3]]

so you could write:

arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }.to_a

Hash::new is given a default value of zero. That means that if the hash being constructed, h, does not have a key s, h[s] returns zero. In that case:

h[s] += 1
  #=> h[s] = h[s] + 1
  #        = 0 + 1 = 1

and when the same value of s in arr is passed to the block:

h[s] += 1
  #=> h[s] = h[s] + 1
  #        = 1 + 1 = 2

You may consider whether it would be better to make the value of each word in the index the hash h.

Upvotes: 1

xlembouras
xlembouras

Reputation: 8295

Do you actually need to have a hash that contains an array of arrays?

This can be much better described with a nested hash

invindex = {
  "word" => { '01.html' => 2, '02.html' => 7, '03.html' => 4 },
  "other" => { '01.html' => 1, '02.html' => 17, '04.html' => 4 }
}

which can be easily populated by using a Hash factory like

invindex = Hash.new { |h,k| h[k] = Hash.new {|hh,kk| hh[kk] = 0} }
tokens.each do |token|
  invindex[token][doc_name] += 1
end

now if you absolutely need to have the format you mention you can get it from the described invindex with a simple iteration

result = {}
invindex.each {|k,v| result[k] = v.to_a }

Upvotes: 1

Related Questions