Reputation: 13
I'm trying to write a program that alphabetizes and displays the frequency of words in a given text. I also had to identify stop words from the text and remove them (hence the beginning part.) this program runs, but it displays the frequency of each word per line, instead of the entirety of the text. so I have duplicates of words. I'm not sure what I'm doing wrong.
l[a] = currentStr.split
words = ""
words = l[a]
stop_words= %w{a and any be by for in it of that the their they then this to we will which}
unique = words - stop_words
unique = l[a]
frequency = Hash.new(0)
unique.each { |unique| frequency[unique] +=1 }
frequency = frequency.sort_by {|x,y| x }
frequency.each { |unique, frequency| puts unique + ' ' + frequency.to_s }
Upvotes: 1
Views: 298
Reputation: 110685
The data
If you are reading the text from a file named "my_new_book", you can "gulp" the whole file as a string, referenced by the variable text
, like this:
text = File.read("my_new_text")
If you are not reading from a file, another way is to use a "here document", like this:
text =<<THE_END
It was the best
of times, it was
the worst of times
THE_END
#=> "It was the best\nof times, it was\nthe worst of times\n"
(with THE_END
starting at the beginning of the line).
Walking through your code
Let's start by making
STOP_WORDS = %w{a and any be by for in it of that the their they then }
a constant. (I dropped off a few to make it fit on one line.)
I was pleased to see that you created the array of stop words with %w
. That saves time, reduces errors and is more readable that having quotes around every word.
Next you have
word_arr = text.split
For the text in the here doc above,
text.split
#=> ["It", "was", "the", "best", "of", "times",
# "it", "was", "the", "worst", "of", "times"]
Notice that split
(same as text.split(/\s+/)
) splits the string on whitespace, not just spaces:
"lots of whitespace\n\n\n\n\here".split
#=> ["lots", "of", "whitespace", "here"]
Before we split
, we should first convert all the characters in text
to lower-case:
text.downcase
There are two reasons to do this. One, as @Steve mentioned in a comment, is that we want words like "we" and "We" to be treated as identical for the purposes of determining frequency. Secondly, we want to remove stop words that are capitalized.
Now we can split the string and put the individual words in an array:
word_arr = text.downcase.split
Your line
words = ""
does nothing, because it is followed by
words = word_arr
which overwrites ""
.
But why create words
when word_arr
is perfectly fine? So forget words
.
Your way of getting rid of the stop words is also very nice:
unique = words_arr - STOP_WORDS
But you completely undo that with
unique = words_arr
So get rid of that last statement. Also, unique
is not a very good name here because many of the words that are left are probably not unique. Maybe something like nonstop_words
. Hmmm. Maybe not. I'll leave that to you.
This is also very nice:
frequency = Hash.new(0)
unique.each { |word| frequency[word] +=1 }
But not this:
new_frequency = frequency.sort_by {|k,v| k }
(but you have the right idea with sort_by
) because that sorts on the keys, which are words. If you just wanted to sort on frequency, that would be:
new_frequency = frequency.sort_by {|k,v| v }
That gives you the least frequently-occurring words first. If you want the words that appear most frequently first (as I expect you do), you could write
new_frequency = frequency.sort_by {|k,v| v }.reverse
or
new_frequency = frequency.sort_by {|k,v| -v }
(Notice I'm saving to a new object--new_frequency
--that makes debugging a lot easier.)
We still haven't dealt with the problem of words that have the same frequency. You want those sorted alphabetically. That's not a problem because Ruby sorts arrays "lexicographically". When sorting an array, Ruby compares each pair of elements with the method Array#<=>. Please read that doc for an explanation.
The upshot is that we can sort the way you want like this:
new_frequency = frequency.sort_by {|k,v| [-v, k] }
(This assumes you want words appearing most frequently first.) When ordering two words, Ruby first gives preference to the smaller value of -v
(which is the bigger value of v
); if that's the same for both words, it goes to k
to break the tie.
Improving your code
There's one more thing that should be done, and that is to write this in a more Ruby-like way, by "chaining" the various methods we've used above. This is what we have (I've gone back to using words
rather than word_arr
):
words = text.downcase.split
unique = words-STOP_WORDS
frequency = Hash.new(0)
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
Now watch carefully as I pull the rabbit out of the hat. The above is the same as:
frequency = Hash.new(0)
unique = text.downcase.split-STOP_WORDS
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
frequency = Hash.new(0)
(text.downcase.split-STOP_WORDS).each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
frequency =
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }
which is the same as:
new_frequency =
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }.sort_by {|k,v| [-v, k] }
which we might wrap in a method:
def word_frequency(text)
(text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h|
h[word] +=1 }.sort_by {|k,v| [-v, k] }
end
One the other hand, you might not want to chain everything and may prefer to write some or all blocks with do-end:
def word_frequency(text)
words = text.downcase.split-STOP_WORDS
words.each_with_object(Hash.new(0)) do |word,h|
h[word] +=1
end.sort_by { |k,v| [-v, k] }
end
That's entirely up to you.
If you have any problem following any of the last bits, not to worry. I just wanted to give you a flavor for the power of the language, to show you what you can look forward to as you gain experience.
Upvotes: 2