Reputation: 195
I have a large amount text - roughly 7000 words.
I would like to get a count of the words sizes e.g. the count of 4 letter words, 6 letters words using regex.
I am unsure how to go about this - my thought process so far would be to split the sentence into a String array which would allow me to count each individual elements size. Is there an easier way to go about this using a regex? I am using Groovy for this task.
EDIT: So i did get this working using an normal array but it was slightly messy. The final solution simply used Groovy's countBy() method coupled with a small amount of logic for anyone who might come across a similar problem.
Upvotes: 0
Views: 144
Reputation: 48751
Don't forget word boudary token \b
. If you don't put it at both ends of a \w{n}
token then all words longer than n
characters are also found. For a 4 character word \b\w{4}\b
for a six character long word use \b\w{6}\b
. Here is a demo with 7000 words as input string.
Java implementation:
String dummy = ".....";
Pattern pattern = Pattern.compile("\\b\\w{6}\\b");
Matcher matcher = pattern.matcher(dummy);
int count = 0;
while (matcher.find())
count++;
System.out.println(count);
Upvotes: 2
Reputation: 2152
You could generate regexes for each size you want.
\w{6}
would get each word with 6 letters exactly\w{7}
would get each word with 7 letters exactlySo you could run one of these regex on the text, with the global flag enabled (finding every instance in the whole string). This will give you an array of every match, which you can then find the length of.
Upvotes: 0
Reputation: 15852
Read the file using any stream word by word and calculate their length. Store counters in an array and increment values after reading each word.
Upvotes: 0