maria
maria

Reputation: 13

Searching for single words and combination words in Ruby

I want my output to search and count the frequency of the words "candy" and "gram", but also the combinations of "candy gram" and "gram candy," in a given text (whole_file.) I am currently using the following code to display the occurrences of "candy" and "gram," but when I aggregate the combinations within the %w, only the word and frequencies of "candy" and "gram" display. Should I try a different way? thanks so much.

myArray = whole_file.split

stop_words= %w{ candy gram 'candy gram' 'gram candy' } 

nonstop_words = myArray - stop_words

key_words = myArray - nonstop_words

frequency = Hash.new (0)

key_words.each { |word| frequency[word] +=1 }

key_words = frequency.sort_by {|x,y| x }

key_words.each { |word, frequency| puts word + ' ' + frequency.to_s }

Upvotes: 0

Views: 795

Answers (2)

Cary Swoveland
Cary Swoveland

Reputation: 110675

Strip punctuation and convert to lower-case

The first thing you probably want to do is remove all punctuation from the string holding the contents of the file and then convert what's left to lower case, the latter so you don't have worry about counting 'Cat' and 'cat' as the same word. Those two operations can be done in either order.

Changing upper-case letters to lower-case is easy:

text = whole_file.downcase

To remove the punctuation it is probably easier to decide what to keep rather than what to discard. If we only want to keep lower-case letters, you can do this:

text = whole_file.downcase.gsub(/[^a-z]/, '')

That is, substitute an empty string for all characters other than (^) lowercase letters.1

Determine frequency of individual words

If you want to count the number of times text contains the word 'candy', you can use the method String#scan on the string text and then determine the size of the array that is returned:

text.scan(/\bcandy\b/).size

scan returns an array with every occurrence of the string 'candy'; .size returns the size of that array. Here \b ensures 'candy gram' has a word "boundary" at each end, which could be whitespace or the beginning or end of a line or the file. That's to prevent `candycane' from being counted.

A second way is to convert the string text to an array of words, as you have done2:

myArray = text.split

If you don't mind, I'd like to call this:

words = text.split

as I find that more expressive.3

The most direct way to determine the number of times 'candy' appears is to use the method Enumberable#count, like this:

words.count('candy')

You can also use the array difference method, Array#-, as you noted:

words.size - (words - ['candy']).size

If you wish to know the number of times either 'candy' or 'gram' appears, you could of course do the above for each and sum the two counts. Some other ways are:

words.size - (myArray - ['candy', 'gram']).size
words.count { |word| word == 'candy' || word = 'gram' }
words.count { |word| ['candy', 'gram'].include?(word) }

Determine the frequency of all words that appear in the text

Your use of a hash with a default value of zero was a good choice:

def frequency_of_all_words(words)
  frequency = Hash.new(0)
  words.each { |word| frequency[word] +=1 }
  frequency
end

I wrote this as a method to emphasize that words.each... does not return frequency. Often you would see this written more compactly using the method Enumerable#each_with_object, which returns the hash ("object"):

def frequency_of_all_words(words)
  words.each_with_object(Hash.new(0)) { |word, h| h[word] +=1 }
end

Once you have the hash frequency you can sort it as you did:

frequency.sort_by {|word, freq| freq }

or

frequency.sort_by(&:last)

which you could write:

frequency.sort_by {|_, freq| freq }

since you aren't using the first block variable. If you wanted the most frequent words first:

frequency.sort_by(&:last).reverse

or

frequency.sort_by {|_, freq| -freq }

All of these will give you an array. If you want to convert it back to a hash (with the largest values first, say):

Hash[frequency.sort_by(&:last).reverse]

or in Ruby 2.0+,

frequency.sort_by(&:last).reverse.to_h

Count the number of times a substring appears

Now let's count the number of times the string 'candy gram' appears. You might think we could use String#scan on the string holding the entire file, as we did earlier4:

text.scan(/\bcandy gram\b/).size

The first problem is that this won't catch 'candy\ngram'; i.e., when the words are separated by a newline character. We could fix that by changing the regex to /\bcandy\sgram\b/. A second problem is that 'candy gram' might have been 'candy. Gram' in the file, in which case you might not want to count it.

A better way is to use the method Enumerable#each_cons on the array words. The easiest way to show you how that works is by example:

words = %w{ check for candy gram here candy gram again }
  #=> ["check", "for", "candy", "gram", "here", "candy", "gram", "again"]
enum = words.each_cons(2)
  #=> #<Enumerator: ["check", "for", "candy", "gram", "here", "candy",
  #                  "gram", "again"]:each_cons(2)>
enum.to_a
  #=> [["check", "for"], ["for",  "candy"], ["candy", "gram"],
  #    ["gram", "here"], ["here", "candy"], ["candy", "gram"],
  #    ["gram", "again"]]

each_cons(2) returns an enumerator; I've converted it to an array to display its contents.

So we can write

words.each_cons(2).map { |word_pair| word_pair.join(' ') }
  #=> ["check for", "for candy", "candy gram", "gram here",
  #    "here candy", "candy gram", "gram again"]

and lastly:

words.each_cons(2).map { |word_pair|
  word_pair.join(' ') }.count { |s| s == 'candy gram' }
  #=> 2

1 If you also wanted to keep dashes, for hyphenated words, change the regex to /[^-a-z]/ or /[^a-z-]/.

2 Note from String#split that .split is the same as both .split(' ') and .split(/\s+/)).

3 Also, Ruby's naming convention is to use lower-case letters and underscores ("snake-case") for variables and methods, such as my_array.

Upvotes: 0

Dave N
Dave N

Reputation: 398

It sounds like you're after n-grams. You could break the text into combinations of consecutive words in the first place, and then count the occurrences in the resulting array of word groupings. Here's an example:

whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"

[["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]].each do |term|
  terms = whole_file.split(/\s+/).each_cons(term.length).to_a
  puts "#{term.join(" ")} #{terms.count(term)}"
end

EDIT: As was pointed out in the comments below, I wasn't paying close enough attention and was splitting the file on each loop which is obviously not a good idea, especially if it's large. I also hadn't accounted for the fact that the original question may've need to sort by the count, although that wasn't explicitly asked.

whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
# This is simplistic. You would need to address punctuation and other characters before
# or at this step.
split_file = whole_file.split(/\s+/)
terms_to_count = [["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]]
counts = []

terms_to_count.each do |term|
  terms = split_file.each_cons(term.length).to_a
  counts << [term.join(" "), terms.count(term)]
end

# Seemed like you may need to do sorting too, so here that is:
sorted = counts.sort { |a, b| b[1] <=> a[1] }
sorted.each do |count|
  puts "#{count[0]} #{count[1]}"
end

Upvotes: 1

Related Questions