Reputation: 1

How to find word frequency in a text file?

My task is to get the word frequency of this file:

test_words_file-1.txt:

The quick brown fox
Hopefully245this---is   a quick13947
task&&#%*for you to complete.
But maybe the tASk 098234 will be less
..quicK.
the the the the the the the the the the

I've been trying to remove symbols and digits from this file and get each word's frequency in alphabetical order, and the result is:

I can see that even digits have been removed but are still being counted. Can you explain why and how can I fix this?

Also, how can I separate "Hopefully245this---is" and store 3 useful words "hopefully", "this", "is"?

public class WordFreq2 {
    public static void main(String[] args) throws FileNotFoundException {

        File file = new File("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        Scanner scanner = new Scanner(file); 
        int maxWordLen = 0; 
        String maxWord = null;

        HashMap<String, Integer> map = new HashMap<>();
        while(scanner.hasNext()) {
            String word = scanner.next();
            word = word.toLowerCase();
            // text cleaning 
            word = word.replaceAll("[^a-zA-Z]+", "");

            if(map.containsKey(word)) {
                //if the word already exists
                int count = map.get(word)+1;
                map.put(word,count);
            }
            else {
                // The word is new 
                int count = 1;
                map.put(word, count);

                // Find the max length of Word
                if (word.length() > maxWordLen) {
                    maxWordLen = word.length();
                    maxWord = word;
                }
            }   
        }

        scanner.close();

        //HashMap unsorted, sort 
        TreeMap<String, Integer> sorted = new TreeMap<>();
        sorted.putAll(map);


        for (Map.Entry<String, Integer> entry: sorted.entrySet()) {
            System.out.println(entry);
        }

        System.out.println(maxWordLen+" ("+maxWord+")");
    }

}

Upvotes: 0

Answers (4)

Durgesh Nandini

Reputation: 11

import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
 
public class test
{
  public static void main(String[] args) throws FileNotFoundException
  {
    File f = new File("C:\\Users\\Nandini\\Downloads\\CountFreq.txt");
    Scanner s = new Scanner(f);
    Map<String, Integer> counts = new HashMap<String, Integer>(); 
    while( s.hasNext() )
    {
             String word = s.next();
             word = word.toLowerCase();
            if( !counts.containsKey( word ) )
             counts.put( word, 1 );
            else
             counts.put( word, counts.get(word) + 1 );
    }
    System.out.println(counts);
  }
  
}

Output: {the=1, this=3, have=1, is=2, word=1}

Upvotes: 0

MartinBG

Reputation: 1666

On Java 9 or newer Matcher#results could be used in a stream solution like this:

    Pattern pattern = Pattern.compile("[a-zA-Z]+");
    try (BufferedReader br = Files.newBufferedReader(Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt"))) {
        br.lines()
                .map(pattern::matcher)
                .flatMap(Matcher::results)
                .map(matchResult -> matchResult.group(0))
                .collect(Collectors.groupingBy(String::toLowerCase, TreeMap::new, Collectors.counting()))
                .forEach((word, count) -> System.out.printf("%s=%s%n", word, count));
    } catch (IOException e) {
        System.err.format("IOException: %s%n", e);
    }

Output:

a=1
be=1
brown=1
but=1
complete=1
for=1
fox=1
hopefully=1
is=1
less=1
maybe=1
quick=3
task=2
the=12
this=1
to=1
will=1
you=1

Upvotes: 0

Abra

Reputation: 20924

First the code. The explanation appears after the below code.

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class WordFreq2 {

    public static void main(String[] args) {
        Path path = Paths.get("C:\\Users\\Jason\\Downloads\\test_words_file-1.txt");
        try {
            String text = Files.readString(path); // throws java.io.IOException
            text = text.toLowerCase();
            Pattern pttrn = Pattern.compile("[a-z]+");
            Matcher mtchr = pttrn.matcher(text);
            TreeMap<String, Integer> freq = new TreeMap<>();
            int longest = 0;
            while (mtchr.find()) {
                String word = mtchr.group();
                int letters = word.length();
                if (letters > longest) {
                    longest = letters;
                }
                if (freq.containsKey(word)) { 
                    freq.computeIfPresent(word, (w, c) -> Integer.valueOf(c.intValue() + 1));
                }
                else {
                    freq.computeIfAbsent(word, (w) -> Integer.valueOf(1));
                }
            }
            String format = "%-" + longest + "s = %2d%n";
            freq.forEach((k, v) -> System.out.printf(format, k, v));
            System.out.println("Longest = " + longest);
        }
        catch (IOException xIo) {
            xIo.printStackTrace();
        }
    }
}

Since your sample file is small, I load the entire file contents into a String.

Then I convert the entire String to lower-case since your definition of a word is a series of consecutive alphabetic, case-insensitive characters.

The regular expression – [a-z]+ – searches for one or more consecutive, lower-case, alphabetic characters. (Remember the entire String is now all lower-case.)

Each successive call to method find() will find the next word in the String (according to the above definition of a word, i.e. a consecutive series of lower-case letters of the alphabet).

To count the letter frequencies, I use a TreeMap where the map key is the word and the map value is the number of times that word appears in the String. Note that map keys and values cannot be primitives, hence the value is Integer and not int.

If the last word found already appears in the map, I increment the count.

If the last word found does not appear in the map, it is added to the map and its count is set to 1 (one).

Along with adding the words to the map, I count the letters of each word found in order to find the longest word.

After the entire String is processed, I print the contents of the map, one entry per line, and finally print the number of letters in the longest word found. Note that TreeMap sorts its keys, hence the list of words appears in alphabetical order.

Here is the output:

a         =  1
be        =  1
brown     =  1
but       =  1
complete  =  1
for       =  1
fox       =  1
hopefully =  1
is        =  1
less      =  1
maybe     =  1
quick     =  3
task      =  2
the       = 12
this      =  1
to        =  1
will      =  1
you       =  1
Longest = 9

Upvotes: 2

Arvind Kumar Avinash

Reputation: 79620

And how can I separate "Hopefully245this---is" and store 3 useful words "hopefully", "this", "is"?

Use regex API for such a requirement.

Demo:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "Hopefully245this---is";
        Pattern pattern = Pattern.compile("[A-Za-z]+");
        Matcher matcher = pattern.matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

Hopefully
this
is

Check the following links to learn more about Java regex:

Upvotes: 1

How to find word frequency in a text file?

Answers (4)

Related Questions