M.Ahmed
M.Ahmed

Reputation: 11

Counting Occurrences of a word from a String

I want to be able to count how many times each word is repeated in a given file. However, I am having trouble doing this. I tried two different ways. One where I use a HashMap and put the word as the key and its frequency as the associated value. However, this doesn't seem to work since wit ha HashMap, you can't access elements at a specified index. Now I am trying to use two separate arrayLists, one for the words and one for each occurrence of that word. My thinking was this: While adding words to the wordsCount arrayList, if a word is already in wordsCount, then increment the value of the element in the cnt ArrayList at the index of the already seen word. However, I am not sure what to write to increment the values

import java.io.*;
import java.lang.reflect.Array;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class MP0 {
    Random generator;
    String delimiters = " \t,;.?!-:@[](){}_*/";
    String[] stopWordsArray = {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
            "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
            "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
            "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
            "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
            "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
            "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
            "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
            "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
            "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"};
    private static String str;
    private static File file;
    private static Scanner s;   

    public MP0() {
    }

    public void process() throws Exception{
        ArrayList<Integer> cnt = new ArrayList<Integer>();
        boolean isStopWord = false;
        StringTokenizer st = new StringTokenizer(s.nextLine(), delimiters);
        ArrayList<String> wordsCount = new ArrayList<String>();

        while(st.hasMoreTokens()) {
            String s = st.nextToken().toLowerCase();
            if(!wordsCount.contains(s)) {
                for(int i = 0; i < stopWordsArray.length; i++) {
                    isStopWord = false;
                    if(s.equals(stopWordsArray[i])) {
                        isStopWord = true;
                        break;
                    }
                }
                if(isStopWord == false) {
                    wordsCount.add(s);
                    cnt.add(1);
                }
            }
            else { // i tried this but only displayed "1" for all words
                cnt.set(wordsCount.indexOf(s), cnt.get(wordsCount.indexOf(s) + 1));
            }
        }


        for(int i = 0; i < wordsCount.size(); i++) {
            System.out.println(wordsCount.get(i) + " " + cnt.get(i));
        }

    }

    public static void main(String args[]) throws Exception {
            try {
                file = new File("input.txt");
                s = new Scanner(file);
                str = s.nextLine();
                String[] topItems;
                MP0 mp = new MP0();
                while(s.hasNext()) {
                    mp.process();
                    str = s.nextLine();
                }
            }
            catch(FileNotFoundException e) {
                System.out.println("File not found");
            }
    }

}

Upvotes: 0

Views: 799

Answers (3)

Tim M.
Tim M.

Reputation: 608

I think a Map is definitely the way to go to represent counts per word. In my opinion, the best way (or at least, a different way that hasn't been mentioned yet) to get the Map is by putting the words through a particular Stream. That way, you can leverage a massive amount of code that's already been written in the Java Standard Library, keeping your code more concise and avoiding the need to reinvent all of the wheels. Streams can have a bit of a learning curve, but once you understand, they can be incredibly useful. For instance, observe your 20+ line method reduced down to 2 lines:

import java.util.Map;
import java.util.ArrayList;
import java.util.Arrays;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.summingInt;
import static java.util.function.Function.identity;

public class CountWords
{
    private static String delimiters = "[ \t,;.?!\\-:@\\[\\](){}_*/]+";
    private static ArrayList<String> stopWords =    new ArrayList<>(Arrays.asList(new String[] {"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours",
                                                "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its",
                                                "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that",
                                                "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having",
                                                "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
                                                "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
                                                "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
                                                "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
                                                "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
                                                "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"}));
    public static void main(String[] args) throws IOException //Your code should likely catch this
    {
        Path fLoc = Paths.get("test.txt"); //Or get from stdio, args[0], etc...
        CountWords cw = new CountWords();
        Map<String, Integer> counts = cw.count(Files.lines(fLoc).flatMap(s -> Arrays.stream(s.split(delimiters))));
        counts.forEach((k, v) -> System.out.format("Key: %s, Val: %d\n", k, v));
    }

    public Map<String, Integer> count(Stream<String> words)
    {
        return words.filter(s -> !stopWords.contains(s))
                    .collect(groupingBy(identity(), summingInt(s -> 1)));
    }
}

It's quite easy to look this all up in the API, but here are the bits that may be less than self-explanatory:

  • Files.lines: A nifty little method that will take a path to a file, and return a Stream of all of the lines in the file. We actually want a stream of words, though, which brings us to the next operation.
  • .flatMap: a mapping operation, in general, takes each item of a set, and converts it to something else. Streams have a method for that, called map, which will take each item and convert it to exactly one other item. In our case, however, we want to convert our lines to words, and each line likely contains many words, so map won't work. Enter flatMap: a mapping operation followed by a flattening operation. A flattening operation, in general, takes each element of a set, and if the child element is a set itself, expands the set so that the parent no longer contains the child set, but rather has all of the child's children as its own children. If that sounded confusing, listen to someone explain it better than I could here. In Java's case, that means that our mapping operation must return a Stream, and the flattening will be taken care of by the flatMap method.
  • Hold on, what's all this -> business? Glad you asked. See, flatMap is a higher-order function - that is, it is a function that takes another function as its argument. We could write the function as a method somewhere (to avoid confusing the terms, because they are very similar: a method is a function attached to an object or class), but this particular function has no logical basis for being attached to any particular object, and what's more, we don't care to reuse it, so it doesn't even need a name. It would be much easier to just specify the function inline. Enter lambda expressions! This question isn't about them, though, so read the link to learn more.
  • Our lambda takes every String, and splits it along your specified delimiters (I converted your delim string to a regular expression). This returns an array, but we need a Stream, so we use the convenience method Arrays.stream for easy conversion. Now each line will be made into a stream of words, and flatMap will handle flattening out the separate lines into a singular stream of all the words in the file. Although I came up with this myself, an almost identical line is used in the common usage examples of the API.
  • .filter: Another higher-order function. This one removes all entries from the stream that don't cause the given function to return true. In your sample code in the question, you refrain from counting all stop words from the array, and so here I use filter to do the same, with help of the rather convenient (and self-explanatory) List.contains (it requires boxing the array you were using inside a List, but I believe it is worth it for the concision you gain). Thus, we have a stream keeping only words that aren't stop words.
  • .collect, groupingBy, etc.: At last, the good stuff. This one short line essentially does all of the work that your question asks for. collect is a method that makes a Stream back into a single object, typically a collection object like a list or array, thus the name. As argument, it can take a Collector, an object that knows how to collect the given Stream into the desired object. We could build our own, but in this case it is unnecessary; once again, the Standard Library has done the work for us. We use the existing collector groupingBy. In its most basic form, groupingBy takes a single argument (a function; again, we have a higher-order function), called a classifier, which sorts the items into groups. For this argument, we provide Function.identity() (statically imported to match the collectors, which are in turn statically imported to match the style with which they're used in examples in the API). This function simply takes the argument and echoes it back out, for cases when you need a function argument but don't actually want to modify input (it's an alternative to the equivalent but ugly x -> x lambda). We want to do this because the return values of this function form the keys to the map we are collecting, and the collector will automatically group together all return values that are .equal under a common key (and all of our repeated words will be .equal to each other).
  • By default, this will leave us with a map that has as keys the words themselves, and as values, streams containing each individual instance of the given word. We don't want this, but luckily, there is a groupingBy overload that gives us a second argument to specify: a collector that will make each Stream value into a singular object value for each key. Since currently the streams contain all instances of each word, we just need to get the length of each stream and use that as the value to each map key. Luckily, one more time, the Standard Library has our back, with the summingInt collector, which sums up an int representation of each item in a stream. Here, we could specify a function that would return a different int for each item (for example, if we were counting total letters rather than words, the expression would be s -> s.length()), but we don't want to, so we neglect to make use of the s variable provided to us, and constantly return 1 with s -> 1, ensuring that 1 will be added for every instance of the word.
  • TL;DR on the count method: we use built-in methods to concisely filter out stop words, then group the remaining words into a map with words as keys, and count the number of instances of those words to be used as the values, all in 2 lines.

Upvotes: 0

Felipe Centeno
Felipe Centeno

Reputation: 3719

I believe you can use a hashmap to do what you want. Something like this:

              HashMap<String, Integer> mymap= new HashMap<>();

                for(String word: stopWordsArray) {
                    if (mymap.containsKey(word))
                        mymap.put(word, mymap.get(word) + 1);
                    else{
                        mymap.put(word, new Integer(1));
                    }
                }

Edit: Added corrections in comments

Second Edit Here is a oracle tutorial on how to do this:

It is the same idea, but it looks a little more concise. Here is a summary with the relevant code:

for (String word : stopWordsArray) {
            Integer freq = m.get(word);
            m.put(word, (freq == null) ? 1 : freq + 1);
        }

Upvotes: 3

Bishoy Kamel
Bishoy Kamel

Reputation: 2355

you can also use Pattern and matcher.

String in = "our goal is our power";
int i = 0;
Pattern p = Pattern.compile("our");
Matcher m = p.matcher( in );
while (m.find()) {
    i++;
}

Upvotes: 0

Related Questions