Reputation: 135
I'm parsing multiple website and trying to build a hash that looks something like:
"word" => [[01.html, 2], [02.html, 7], [03.html, 4]]
where word is a given word in the index, the first value in each sublist is the file it was found in, and the second value is the number of occurrences in that given file.
I'm running into an issue where, rather than appending ["02.html", 7]
inside the values list, it creates a whole new entry for "word" and puts ["02.html", 7]
at the end of the hash. This results in basically giving me individual indexes for all of my websites appended after each other rather than giving me one master index.
Here is my code:
for token in tokens
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
for list in invindex[token]
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
end
Hopefully it's something simple and I just missed something when I traced it on paper.
Upvotes: 0
Views: 82
Reputation: 48599
I'm running into an issue where, rather than appending ["02.html", 7] inside the values list, it creates a whole new entry for "word" and puts ["02.html", 7] at the end of the hash.
I'm not seeing that:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
doc_name = '02.html'
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 2]], :word2=>[["02.html", 1]], :word3=>[["02.html", 1]]}
invindex[token].insert([doc_name, 1]) #this SHOULD append the doc name
Nope:
invindex = {
word: [
['01.html', 2],
]
}
token = :word
doc_name = '02.html'
invindex[token].insert([doc_name, 7])
p invindex
invindex[token].insert(-1, ["02.html", 7])
p invindex
--output:--
{:word=>[["01.html", 2]]}
{:word=>[["01.html", 2], ["02.html", 7]]}
Array#insert() requires that you specify an index as the first argument. Typically when you want to append something to the end, you use <<
:
invindex = {
word: [
['01.html', 2],
]
}
token = :word
doc_name = '02.html'
invindex[token] << [doc_name, 7]
p invindex
--output:--
{:word=>[["01.html", 2], ["02.html", 7]]}
for token in tokens
Rubyists don't use for-in
loops because for-in loops call each()
, so rubyists call each()
directly:
tokens.each do |token|
...
end
Finally, indenting in ruby
is 2 spaces--not 3 spaces, not 1 space, not 4 spaces. It's 2 spaces.
Applying all that to your code:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
doc_name = '01.html'
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 3]], :word2=>[["01.html", 1]], :word3=>[["01.html", 1]]}
However, there is still a problem, which is due to the fact that you are changing an Array that you are stepping through--a big no-no in computer programming:
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #***PROBLEM***
Look what happens:
invindex = {
word1: [
['01.html', 2],
]
}
tokens = %i[
word1
word2
word3
]
%w[ 01.html 02.html].each do |doc_name|
tokens.each do |token|
if !invindex.include?(token)
invindex[token] = [[doc_name, 1]] #adds the word to the hash with the doc name and occurrence of 1
else
invindex[token].each do |list|
if list[0] == doc_name
list[1] += 1 #adds one to the occurrence with the same doc_name
else
invindex[token] << [doc_name, 1] #this SHOULD append the doc name and initial occurrence inside the word's value list since the word is already in the hash
end
end
end
end
end
p invindex
--output:--
{:word1=>[["01.html", 3], ["02.html", 2]], :word2=>[["01.html", 1], ["02.html", 2]], :word3=>[["01.html", 1], ["02.html", 2]]}
Problem 1: You don't want to insert [doc_name, 1]
every time the sub Array you are examining doesn't contain the doc_name
--you only want to insert [doc_name, 1]
after ALL the sub Arrays have been examined, and the doc_name
wasn't found. If you run the example above with the starting hash:
invindex = {
word1: [
['01.html', 2],
['02.html', 7],
]
}
...you will see that the output is even worse.
Problem 2: Appending [doc_name, 1]
to the Array while you are stepping through the Array means that [doc-name, 1]
will be examined, too, when the loop gets to the end of the Array--and then your loop will increment its count to 2. The rule is: don't change an Array you are stepping through because bad things will happen.
Upvotes: 1
Reputation: 110675
Suppose:
arr = %w| 01.html 02.html 03.html 02.html 03.html 03.html |
#=> ["01.html", "02.html", "03.html", "02.html", "03.html", "03.html"]
is an array of your files for a given word in the index. Then the value of that word in the hash is given by constructing the counting hash:
h = arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }
#=> {"01.html"=>1, "02.html"=>2, "03.html"=>3}
and then converting it to an array:
h.to_a
#=> [["01.html", 1], ["02.html", 2], ["03.html", 3]]
so you could write:
arr.each_with_object(Hash.new(0)) { |s,h| h[s] += 1 }.to_a
Hash::new is given a default value of zero. That means that if the hash being constructed, h
, does not have a key s
, h[s]
returns zero. In that case:
h[s] += 1
#=> h[s] = h[s] + 1
# = 0 + 1 = 1
and when the same value of s
in arr
is passed to the block:
h[s] += 1
#=> h[s] = h[s] + 1
# = 1 + 1 = 2
You may consider whether it would be better to make the value of each word in the index the hash h
.
Upvotes: 1
Reputation: 8295
Do you actually need to have a hash that contains an array of arrays?
This can be much better described with a nested hash
invindex = {
"word" => { '01.html' => 2, '02.html' => 7, '03.html' => 4 },
"other" => { '01.html' => 1, '02.html' => 17, '04.html' => 4 }
}
which can be easily populated by using a Hash factory like
invindex = Hash.new { |h,k| h[k] = Hash.new {|hh,kk| hh[kk] = 0} }
tokens.each do |token|
invindex[token][doc_name] += 1
end
now if you absolutely need to have the format you mention you can get it from the described invindex
with a simple iteration
result = {}
invindex.each {|k,v| result[k] = v.to_a }
Upvotes: 1