Reputation: 1
My task is to get the word frequency of this file:
test_words_file-1.txt:
The quick brown fox
Hopefully245this---is a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the
I've been trying to remove symbols and digits from this file and get each word's frequency in alphabetical order, and the result is:
I can see that even digits have been removed but are still being counted. Can you explain why and how can I fix this?
Also, how can I separate "Hopefully245this---is" and store 3 useful words "hopefully", "this", "is"?
public class WordFreq2 {
public static void main(String[] args) throws FileNotFoundException {
File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
Scanner scanner = new Scanner(file);
int maxWordLen = 0;
String maxWord = null;
HashMap<String, Integer> map = new HashMap<>();
while(scanner.hasNext()) {
String word = scanner.next();
word = word.toLowerCase();
// text cleaning
word = word.replaceAll("[^a-zA-Z]+", "");
if(map.containsKey(word)) {
//if the word already exists
int count = map.get(word)+1;
map.put(word,count);
}
else {
// The word is new
int count = 1;
map.put(word, count);
// Find the max length of Word
if (word.length() > maxWordLen) {
maxWordLen = word.length();
maxWord = word;
}
}
}
scanner.close();
//HashMap unsorted, sort
TreeMap<String, Integer> sorted = new TreeMap<>();
sorted.putAll(map);
for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
System.out.println(entry);
}
System.out.println(maxWordLen+" ("+maxWord+")");
}
}
Upvotes: 0
Views: 3957
Reputation: 11
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class test
{
public static void main(String[] args) throws FileNotFoundException
{
File f = new File("C:\\Users\\Nandini\\Downloads\\CountFreq.txt");
Scanner s = new Scanner(f);
Map<String, Integer> counts = new HashMap<String, Integer>();
while( s.hasNext() )
{
String word = s.next();
word = word.toLowerCase();
if( !counts.containsKey( word ) )
counts.put( word, 1 );
else
counts.put( word, counts.get(word) + 1 );
}
System.out.println(counts);
}
}
Output: {the=1, this=3, have=1, is=2, word=1}
Upvotes: 0
Reputation: 1666
On Java 9 or newer Matcher#results could be used in a stream solution like this:
Pattern pattern = Pattern.compile("[a-zA-Z]+");
try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
br.lines()
.map(pattern::matcher)
.flatMap(Matcher::results)
.map(matchResult -> matchResult.group(0))
.collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
.forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
} catch (IOException e) {
System.err.format("IOException: %s%n", e);
}
Output:
a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1
Upvotes: 0
Reputation: 20924
First the code. The explanation appears after the below code.
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordFreq2 {
public static void main(String[] args) {
Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
try {
String text = Files.readString(path); // throws java.io.IOException
text = text.toLowerCase();
Pattern pttrn = Pattern.compile("[a-z]+");
Matcher mtchr = pttrn.matcher(text);
TreeMap<String, Integer> freq = new TreeMap<>();
int longest = 0;
while (mtchr.find()) {
String word = mtchr.group();
int letters = word.length();
if (letters > longest) {
longest = letters;
}
if (freq.containsKey(word)) {
freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
}
else {
freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
}
}
String format = "%-" + longest + "s = %2d%n";
freq.forEach((k, v) -> System.out.printf(format, k, v));
System.out.println("Longest = " + longest);
}
catch (IOException xIo) {
xIo.printStackTrace();
}
}
}
Since your sample file is small, I load the entire file contents into a String
.
Then I convert the entire String
to lower-case since your definition of a word is a series of consecutive alphabetic, case-insensitive characters.
The regular expression – [a-z]+
– searches for one or more consecutive, lower-case, alphabetic characters. (Remember the entire String
is now all lower-case.)
Each successive call to method find()
will find the next word in the String
(according to the above definition of a word, i.e. a consecutive series of lower-case letters of the alphabet).
To count the letter frequencies, I use a TreeMap
where the map key is the word and the map value is the number of times that word appears in the String
. Note that map keys and values cannot be primitives, hence the value is Integer
and not int
.
If the last word found already appears in the map, I increment the count.
If the last word found does not appear in the map, it is added to the map and its count is set to 1 (one).
Along with adding the words to the map, I count the letters of each word found in order to find the longest word.
After the entire String
is processed, I print the contents of the map, one entry per line, and finally print the number of letters in the longest word found. Note that TreeMap
sorts its keys, hence the list of words appears in alphabetical order.
Here is the output:
a = 1
be = 1
brown = 1
but = 1
complete = 1
for = 1
fox = 1
hopefully = 1
is = 1
less = 1
maybe = 1
quick = 3
task = 2
the = 12
this = 1
to = 1
will = 1
you = 1
Longest = 9
Upvotes: 2
Reputation: 79620
And how can I separate "Hopefully245this---is" and store 3 useful words "hopefully", "this", "is"?
Use regex API for such a requirement.
Demo:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "Hopefully245this---is";
Pattern pattern = Pattern.compile("[A-Za-z]+");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
Hopefully
this
is
Check the following links to learn more about Java regex:
Upvotes: 1