Ruby: Frequency and Alphabetizing

Question

I'm trying to write a program that alphabetizes and displays the frequency of words in a given text. I also had to identify stop words from the text and remove them (hence the beginning part.) this program runs, but it displays the frequency of each word per line, instead of the entirety of the text. so I have duplicates of words. I'm not sure what I'm doing wrong.

l[a] = currentStr.split
words = ""
words = l[a]
stop_words= %w{a and any be by for in it of that the their they then this to we will which} 
unique = words - stop_words
unique = l[a]

frequency = Hash.new(0) 
unique.each { |unique| frequency[unique] +=1 } 

frequency = frequency.sort_by {|x,y| x } 
frequency.each { |unique, frequency| puts unique + ' ' + frequency.to_s }

Cary Swoveland · Accepted Answer

The data

If you are reading the text from a file named "my_new_book", you can "gulp" the whole file as a string, referenced by the variable text, like this:

text = File.read("my_new_text")

If you are not reading from a file, another way is to use a "here document", like this:

text =< "It was the best
of times, it was
the worst of times
"

(with THE_END starting at the beginning of the line).

Walking through your code

Let's start by making

STOP_WORDS = %w{a and any be by for in it of that the their they then }

a constant. (I dropped off a few to make it fit on one line.)

I was pleased to see that you created the array of stop words with %w. That saves time, reduces errors and is more readable that having quotes around every word.

Next you have

word_arr = text.split

For the text in the here doc above,

text.split
  #=> ["It", "was", "the", "best", "of", "times",
  #    "it", "was", "the", "worst", "of", "times"]

Notice that split (same as text.split(/\s+/)) splits the string on whitespace, not just spaces:

"lots    of whitespace



\here".split
  #=> ["lots", "of", "whitespace", "here"]

Before we split, we should first convert all the characters in text to lower-case:

text.downcase

There are two reasons to do this. One, as @Steve mentioned in a comment, is that we want words like "we" and "We" to be treated as identical for the purposes of determining frequency. Secondly, we want to remove stop words that are capitalized.

Now we can split the string and put the individual words in an array:

word_arr = text.downcase.split

Your line

words = ""

does nothing, because it is followed by

words = word_arr

which overwrites "".

But why create words when word_arr is perfectly fine? So forget words.

Your way of getting rid of the stop words is also very nice:

unique = words_arr - STOP_WORDS

But you completely undo that with

unique = words_arr

So get rid of that last statement. Also, unique is not a very good name here because many of the words that are left are probably not unique. Maybe something like nonstop_words. Hmmm. Maybe not. I'll leave that to you.

This is also very nice:

frequency = Hash.new(0) 
unique.each { |word| frequency[word] +=1 }

But not this:

new_frequency = frequency.sort_by {|k,v| k }

(but you have the right idea with sort_by) because that sorts on the keys, which are words. If you just wanted to sort on frequency, that would be:

new_frequency = frequency.sort_by {|k,v| v }

That gives you the least frequently-occurring words first. If you want the words that appear most frequently first (as I expect you do), you could write

new_frequency = frequency.sort_by {|k,v| v }.reverse

or

new_frequency = frequency.sort_by {|k,v| -v }

(Notice I'm saving to a new object--new_frequency--that makes debugging a lot easier.)

We still haven't dealt with the problem of words that have the same frequency. You want those sorted alphabetically. That's not a problem because Ruby sorts arrays "lexicographically". When sorting an array, Ruby compares each pair of elements with the method Array#<=>. Please read that doc for an explanation.

The upshot is that we can sort the way you want like this:

new_frequency = frequency.sort_by {|k,v| [-v, k] }

(This assumes you want words appearing most frequently first.) When ordering two words, Ruby first gives preference to the smaller value of -v (which is the bigger value of v); if that's the same for both words, it goes to k to break the tie.

Improving your code

There's one more thing that should be done, and that is to write this in a more Ruby-like way, by "chaining" the various methods we've used above. This is what we have (I've gone back to using words rather than word_arr):

words = text.downcase.split
unique = words-STOP_WORDS
frequency = Hash.new(0) 
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

Now watch carefully as I pull the rabbit out of the hat. The above is the same as:

frequency = Hash.new(0) 
unique = text.downcase.split-STOP_WORDS
unique.each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

frequency = Hash.new(0) 
(text.downcase.split-STOP_WORDS).each { |word| frequency[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

frequency =
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
    h[word] +=1 }
new_frequency = frequency.sort_by {|k,v| [-v, k] }

which is the same as:

new_frequency =
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
    h[word] +=1 }.sort_by {|k,v| [-v, k] }

which we might wrap in a method:

def word_frequency(text)
  (text.downcase.split-STOP_WORDS).each_with_object(Hash.new(0)) { |word,h| 
  h[word] +=1 }.sort_by {|k,v| [-v, k] }
end

One the other hand, you might not want to chain everything and may prefer to write some or all blocks with do-end:

def word_frequency(text)
  words = text.downcase.split-STOP_WORDS
  words.each_with_object(Hash.new(0)) do |word,h| 
    h[word] +=1
  end.sort_by { |k,v| [-v, k] }
end

That's entirely up to you.

If you have any problem following any of the last bits, not to worry. I just wanted to give you a flavor for the power of the language, to show you what you can look forward to as you gain experience.

Ruby: Frequency and Alphabetizing

Answers (1)

Related Questions