Reputation:

Remove all special char except apostrophe

Given a sentence, I want to count all the duplicated words: It is an exercice from Exercism.io Word count

For example for the input "olly olly in come free"

plain olly: 2 in: 1 come: 1 free: 1

I have this test for exemple:

  def test_with_quotations
    phrase = Phrase.new("Joe can't tell between 'large' and large.")
    counts = {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}
    assert_equal counts, phrase.word_count
  end

this is my method

def word_count
    phrase = @phrase.downcase.split(/\W+/)
    counts = phrase.group_by{|word| word}.map {|k,v| [k, v.count]}
    Hash[*counts.flatten]
  end

For the test above I have this failure when I run it in the terminal:

  2) Failure:
PhraseTest#test_with_apostrophes [word_count_test.rb:69]:
--- expected
+++ actual
@@ -1 +1 @@
-{"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}
+{"first"=>1, "don"=>2, "t"=>2, "laugh"=>1, "then"=>1, "cry"=>1}

My problem is to remove all chars except 'apostrophe...

the regex in the method almost works... phrase = @phrase.downcase.split(/\W+/) but it remove the apostrophes...

I don't want to keep the single quote around a word, 'Hello' => Hello but Don't be cruel => Don't be cruel

Upvotes: 1

Answers (3)

Sagar Pandya

Reputation: 9508

Another way:

str = "First: don't 'laugh'. Then: 'don't cry'."
reg = /
      [a-z]         #single letter
      [a-z']+       #one or more letters or apostrophe
      [a-z]         #single letter
      '?            #optional single apostrophe

      /ix           #case-insensitive and free-spacing regex

str.scan(reg).group_by(&:itself).transfor‌m_values(&:count) 
  #=> {"First"=>1, "don't"=>2, "laugh"=>1, "Then"=>1, "cry'"=>1}

Upvotes: 0

grail

Reputation: 930

Maybe something like:

string.scan(/\b[\w']+\b/i).each_with_object(Hash.new(0)){|a,(k,v)| k[a]+=1}

The regex employs word boundaries (\b). The scan outputs an array of the found words and for each word in the array they are added to the hash, which has a default value of zero for each item which is then incremented.

Turns out my solution whilst finding all items and ignoring case will still leave the items in the case they were found in originally. This would now be a decision for Nelly to either accept as is or to perform a downcase on the original string or the array item as it is added to the hash.

I'll leave that decision up to you :)

Upvotes: 4

dawg

Reputation: 104072

Given:

irb(main):015:0> phrase
=> "First: don't laugh. Then: don't cry."

Try:

irb(main):011:0> Hash[phrase.downcase.scan(/[a-z']+/)
                     .group_by{|word| word.downcase}
                     .map{|word, words|[word, words.size]}
                    ]
=> {"first"=>1, "don't"=>2, "laugh"=>1, "then"=>1, "cry"=>1}

With your update, if you want to remove single quotes, do that first:

irb(main):038:0> p2
=> "Joe can't tell between 'large' and large."
irb(main):039:0> p2.gsub(/(?<!\w)'|'(?!\w)/,'')
=> "Joe can't tell between large and large."

Then use the same method.

But you say -- gsub(/(?<!\w)'|'(?!\w)/,'') will remove the apostrophe in 'Twas the night before. Which I reply you will eventually need to build a parser that can determine the distinction between an apostrophe and a single quote if /(?<!\w)'|'(?!\w)/ is not sufficient.

You can also use word boundaries:

irb(main):041:0> Hash[p2.downcase.scan(/\b[a-z']+\b/)
                  .group_by{|word| word.downcase}
                  .map{|word, words|[word, words.size]}
                 ]
=> {"joe"=>1, "can't"=>1, "tell"=>1, "between"=>1, "large"=>2, "and"=>1}

But that does not solve 'Tis the night either.

Upvotes: 1

Remove all special char except apostrophe

Answers (3)

Related Questions