Reputation: 47

Most efficient data structure for storing an alphabetically ordered word list

My program will read in a paragraph of words (stored in a text file). It will then need to do the following:

Print out a list of all the words (alphabetical order). For each word, print the frequency count (how many times the word appears in the entire paragraph) and the line numbers in which the word appears on (does not need to be ordered). If a word appears on a line multiple times, the line number does not need to be stored twice (the frequency count of this word will still be updated)
Display a list of words ordered from most frequent to least frequent.
The user will input a specific word. If the word is found, print out its frequency count.

Limitations: I cannot use the Collections class and I cannot store data multiple times. (e.g. Reading in words from the paragraph and storing them into a Set and an ArrayList)

Coding this won't be hard, but I can't figure out what would be the most efficient implementation since the data size could be a few paragraphs from a Wikipedia article or something. Here's my idea for now:

Have a Word class. This Word class will contain methods to return the word's frequency count and the lines in which the word appears on (and other relevant data).
The paragraph will be stored in a text file. The program will read the data line by line. Split the line into an array and read in words one by one.
As words are being read in from the text file, put the words into some sort of structure. If the structure does not contain the word, create a new word object.
If the structure already contains the word, update the frequency counter for that word.
- I will also have a int to record down the line number. These line numbers will be updated accordingly.

This is somewhat incomplete, but it is what I'm thinking for now. The whole 'Word' class may probably be completely unnecessary, too.

Upvotes: 1

Answers (4)

Mick Mnemonic

Reputation: 7956

First, you could create a class that holds the data for the occurrences and the row numbers (along with the word). This class could implement the Comparable interface, providing easy comparisons based on the word frequencies:

public class WordOccurrence implements Comparable<WordOccurrence> {

    private final String word;
    private int totalCount = 0;
    private Set<Integer> lineNumbers = new TreeSet<>();

    public WordOccurrence(String word, int firstLineNumber) {
        this.word = word;
        addOccurrence(firstLineNumber);
    }

    public final void addOccurrence(int lineNumber) {
        totalCount++;
        lineNumbers.add(lineNumber);
    }

    @Override
    public int compareTo(WordOccurrence o) {
        return totalCount - o.totalCount;
    }

    @Override
    public String toString() {
        StringBuilder lineNumberInfo = new StringBuilder("[");
        for (int line : lineNumbers) {
            if (lineNumberInfo.length() > 1) {
                lineNumberInfo.append(", ");
            }
            lineNumberInfo.append(line);
        }
        lineNumberInfo.append("]");
        return word + ", occurences: " + totalCount + ", on rows "
                + lineNumberInfo.toString();
    }
}

When reading the words from the file, it's useful to return the data in a Map<String, WordOccurrence>, mapping words into WordOccurrences. Using a TreeMap, you'll get alphabetical ordering "for free". Also, you may want to remove punctuation from the lines (e.g. using a regexp like \\p{P}) and ignore the case of the words:

public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
        throws IOException {
    TreeMap<String, WordOccurrence> words = new TreeMap<>();

    File file = new File(filePath);
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(file)));
    String line = null;
    int lineNumber = 0;

    while ((line = reader.readLine()) != null) {
        // remove punctuation and normalize to lower-case
        line = line.replaceAll("\\p{P}", "").toLowerCase();
        lineNumber++;
        String[] tokens = line.split("\\s+");
        for (String token : tokens) {

            if (words.containsKey(token)) {
                words.get(token).addOccurrence(lineNumber);
            } else {
                words.put(token, new WordOccurrence(token, lineNumber));
            }
        }
    }

    return words;
}

Displaying the occurrences in alphabetical order using the above code is as simple as

for (Map.Entry<String, WordOccurrence> entry :
         countOccurrences("path/to/file").entrySet()) {
        System.out.println(entry.getValue());
}

If you cannot use Collections.sort() (and a Comparator<WordOccurrence>) for sorting by occurrences, you need to write the sorting yourself. Something like this should do it:

public static void displayInOrderOfOccurrence(
        Map<String, WordOccurrence> words) {

    List<WordOccurrence> orderedByOccurrence = new ArrayList<>();

    // sort
    for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
        WordOccurrence wo = entry.getValue();

        // initialize the list on the first round
        if (orderedByOccurrence.isEmpty()) {
            orderedByOccurrence.add(wo);
        } else {

            for (int i = 0; i < orderedByOccurrence.size(); i++) {
                if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
                    orderedByOccurrence.add(i, wo);
                    break;
                } else if (i == orderedByOccurrence.size() - 1) {
                    orderedByOccurrence.add(wo);
                    break;
                }
            }
        }
    }

    // display
    for (WordOccurrence wo : orderedByOccurence) {
        System.out.println(wo);
    }
}

Running the above code using the following test data:

Potato; orange.
Banana; apple, apple; potato.
Potato.

will produce this output:

apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]
potato, occurrences: 3, on rows [1, 2, 3]

potato, occurrences: 3, on rows [1, 2, 3]
apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]

Upvotes: 2

jeorfevre

Reputation: 2316

you can have a structure like this one : https://gist.github.com/jeorfevre/946ede55ad93cc811cf8

/**
* 
* @author Jean-Emmanuel [email protected]
*
*/
public class WordsIndex{
        HashMap<String, Word> words = new HashMap<String, Word>();

        public static void put(String word, int line, int paragraph){
            word=word.toLowerCase();

            if(words.containsKey(word)){
                Word w=words.get(word);
                w.count++;

            }else{
                //new word
                Word w = new Word();
                w.count=1;
                w.line=line;
                w.paragraph=paragraph;
                w.word=word;
                words.put(word, w);
            }



        }
    }

    public class Word{
        String word;
        int count;
        int line;
        int paragraph;
    }

enjoy

Upvotes: 0

Razvan Manolescu

Reputation: 434

You can use a simple TreeMap<String, Integer> for frequency lookups.

Lookups should be O(1), given that the words are short(i.e what you would find a normal text). If you expect lots of unsuccessful lookups (lots of searches for words that don't exist), you could prefilter using a Bloom Filter.

I'd start with a straightforward implementation, and optimize further if needed (parse the stream directly, instead of splitting each line with a separator and reiterating).

Upvotes: 2

Alaa Abuzaghleh

Reputation: 1009

you can use TreeMap it is very suitable for getting the data ordered. use your word as key and the frequency as value. for example let the following is you paragraph

Java is good language Java is object oriented so I will do the following in order to store each word and its frequency

String s = "Java is good language Java is object oriented"   ; 
String strArr [] = s.split(" ") ; 
TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
for(String str : strArr){
   if(tm.get(str) == null){
         tm.put(str, 1) ; 
   }else{
        int count = tm.get(str) ; 
        count+=1 ; 

   }
}

hopefully this will help you

Upvotes: 0

Most efficient data structure for storing an alphabetically ordered word list

Answers (4)

Related Questions