sent-hil
sent-hil

Reputation: 19285

Word count in Rails?

Say I have a blog model with Title and Body. How I do show the number of words in Body and characters in Title? I want the output to be something like this

Title: Lorem Body: Lorem Lorem Lorem

This post has word count of 3.

Upvotes: 17

Views: 10947

Answers (6)

Repolês
Repolês

Reputation: 1793

"caçapão adipisicing elit".scan(/[\w-]+/).size 
=> 5

But as we can see, the sentence has only 3 words. The problem is related with the accented characters, because the regex \w doesn't consider them as a word character [A-Za-z0-9_].

An improved solution would be

I18n.transliterate("caçapão adipisicing elit").scan(/[\w-]+/).size
=> 3

Upvotes: 3

Mohamad
Mohamad

Reputation: 35349

The answers here have a couple of issues:

  1. They don't account for utf and unicode chars (diacritics): áâãêü etc...
  2. They don't account for apostrophes and hyphens. So Joe's will be considered two words Joe and 's which is obviously incorrect. As will twenty-two, which is a single compound word.

Something like this works better and account for those issues:

foo.scan(/[\p{Alpha}\-']+/)

You might want to look at my Words Counted gem. It allows to count words, their occurrences, lengths, and a couple of other things. It's also very well documented.

counter = WordsCounted::Counter.new(post.body)
counter.word_count #=> 3
counter.most_occuring_words #=> [["lorem", 3]]
# This also takes into capitalisation into account.
# So `Hello` and `hello` are counted as the same word.

Upvotes: 3

schreifels
schreifels

Reputation: 431

If you're interested in performance, I wrote a quick benchmark:

require 'benchmark'
require 'bigdecimal/math'
require 'active_support/core_ext/string/filters'

# Where "shakespeare" is the full text of The Complete Works of William Shakespeare...

puts 'Benchmarking shakespeare.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.squish.scan(/\w+/).size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.scan(/\w+/).size } }
puts 'Benchmarking shakespeare.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.split.size } }
puts 'Benchmarking shakespeare.squish.split.size x50'
puts Benchmark.measure { 50.times { shakespeare.squish.split.size } }

The results:

Benchmarking shakespeare.scan(/\w+/).size x50
 13.980000   0.240000  14.220000 ( 14.234612)
Benchmarking shakespeare.squish.scan(/\w+/).size x50
 40.850000   0.270000  41.120000 ( 41.109643)
Benchmarking shakespeare.split.size x50
  5.820000   0.210000   6.030000 (  6.028998)
Benchmarking shakespeare.squish.split.size x50
 31.000000   0.260000  31.260000 ( 31.268706)

In other words, squish is slow with Very Large Strings™. Other than that, split is faster (twice as fast if you're not using squish).

Upvotes: 5

YOU
YOU

Reputation: 123821

"Lorem Lorem Lorem".scan(/\w+/).size
=> 3

UPDATE: if you need to match rock-and-roll as one word, you could do like

"Lorem Lorem Lorem rock-and-roll".scan(/[\w-]+/).size
=> 4

Upvotes: 37

IDBD
IDBD

Reputation: 298

"Lorem Lorem Lorem".scan(/\S+/).size
=> 3

Upvotes: 2

P4010
P4010

Reputation: 201

Also:

"Lorem Lorem Lorem".split.size
=> 3

Upvotes: 20

Related Questions