Perfect_Comment
Perfect_Comment

Reputation: 195

How to get a count of the word sizes in a large amount of text?

I have a large amount text - roughly 7000 words.

I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.

I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.

EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.

Upvotes: 0

Views: 144

Answers (3)

revo
revo

Reputation: 48751

Don't forget word boudary token \b. If you don't put it at both ends of a \w{n} token then all words longer than n characters are also found. For a 4 character word \b\w{4}\b for a six character long word use \b\w{6}\b. Here is a demo with 7000 words as input string.

Java implementation:

String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);

int count = 0;
while (matcher.find())
    count++;

System.out.println(count);

Upvotes: 2

Whothehellisthat
Whothehellisthat

Reputation: 2152

You could generate regexes for each size you want.

  • \w{6} would get each word with 6 letters exactly
  • \w{7} would get each word with 7 letters exactly
  • and so on...

So you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.

Upvotes: 0

xenteros
xenteros

Reputation: 15852

Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.

Upvotes: 0

Related Questions