Jin
Jin

Reputation: 47

Most efficient data structure for storing an alphabetically ordered word list

My program will read in a paragraph of words (stored in a text file). It will then need to do the following:

Limitations: I cannot use the Collections class and I cannot store data multiple times. (e.g. Reading in words from the paragraph and storing them into a Set and an ArrayList)

Coding this won't be hard, but I can't figure out what would be the most efficient implementation since the data size could be a few paragraphs from a Wikipedia article or something. Here's my idea for now:

This is somewhat incomplete, but it is what I'm thinking for now. The whole 'Word' class may probably be completely unnecessary, too.

Upvotes: 1

Views: 4879

Answers (4)

Mick Mnemonic
Mick Mnemonic

Reputation: 7956

First, you could create a class that holds the data for the occurrences and the row numbers (along with the word). This class could implement the Comparable interface, providing easy comparisons based on the word frequencies:

public class WordOccurrence implements Comparable<WordOccurrence> {

    private final String word;
    private int totalCount = 0;
    private Set<Integer> lineNumbers = new TreeSet<>();

    public WordOccurrence(String word, int firstLineNumber) {
        this.word = word;
        addOccurrence(firstLineNumber);
    }

    public final void addOccurrence(int lineNumber) {
        totalCount++;
        lineNumbers.add(lineNumber);
    }

    @Override
    public int compareTo(WordOccurrence o) {
        return totalCount - o.totalCount;
    }

    @Override
    public String toString() {
        StringBuilder lineNumberInfo = new StringBuilder("[");
        for (int line : lineNumbers) {
            if (lineNumberInfo.length() > 1) {
                lineNumberInfo.append(", ");
            }
            lineNumberInfo.append(line);
        }
        lineNumberInfo.append("]");
        return word + ", occurences: " + totalCount + ", on rows "
                + lineNumberInfo.toString();
    }
}

When reading the words from the file, it's useful to return the data in a Map<String, WordOccurrence>, mapping words into WordOccurrences. Using a TreeMap, you'll get alphabetical ordering "for free". Also, you may want to remove punctuation from the lines (e.g. using a regexp like \\p{P}) and ignore the case of the words:

public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
        throws IOException {
    TreeMap<String, WordOccurrence> words = new TreeMap<>();

    File file = new File(filePath);
    BufferedReader reader = new BufferedReader(new InputStreamReader(
            new FileInputStream(file)));
    String line = null;
    int lineNumber = 0;

    while ((line = reader.readLine()) != null) {
        // remove punctuation and normalize to lower-case
        line = line.replaceAll("\\p{P}", "").toLowerCase();
        lineNumber++;
        String[] tokens = line.split("\\s+");
        for (String token : tokens) {

            if (words.containsKey(token)) {
                words.get(token).addOccurrence(lineNumber);
            } else {
                words.put(token, new WordOccurrence(token, lineNumber));
            }
        }
    }

    return words;
}

Displaying the occurrences in alphabetical order using the above code is as simple as

for (Map.Entry<String, WordOccurrence> entry :
         countOccurrences("path/to/file").entrySet()) {
        System.out.println(entry.getValue());
}

If you cannot use Collections.sort() (and a Comparator<WordOccurrence>) for sorting by occurrences, you need to write the sorting yourself. Something like this should do it:

public static void displayInOrderOfOccurrence(
        Map<String, WordOccurrence> words) {

    List<WordOccurrence> orderedByOccurrence = new ArrayList<>();

    // sort
    for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
        WordOccurrence wo = entry.getValue();

        // initialize the list on the first round
        if (orderedByOccurrence.isEmpty()) {
            orderedByOccurrence.add(wo);
        } else {

            for (int i = 0; i < orderedByOccurrence.size(); i++) {
                if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
                    orderedByOccurrence.add(i, wo);
                    break;
                } else if (i == orderedByOccurrence.size() - 1) {
                    orderedByOccurrence.add(wo);
                    break;
                }
            }
        }
    }

    // display
    for (WordOccurrence wo : orderedByOccurence) {
        System.out.println(wo);
    }
}

Running the above code using the following test data:

Potato; orange.
Banana; apple, apple; potato.
Potato.

will produce this output:

apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]
potato, occurrences: 3, on rows [1, 2, 3]

potato, occurrences: 3, on rows [1, 2, 3]
apple, occurrences: 2, on rows [2]
banana, occurrences: 1, on rows [2]
orange, occurrences: 1, on rows [1]

Upvotes: 2

jeorfevre
jeorfevre

Reputation: 2316

you can have a structure like this one : https://gist.github.com/jeorfevre/946ede55ad93cc811cf8

/**
* 
* @author Jean-Emmanuel [email protected]
*
*/
public class WordsIndex{
        HashMap<String, Word> words = new HashMap<String, Word>();

        public static void put(String word, int line, int paragraph){
            word=word.toLowerCase();

            if(words.containsKey(word)){
                Word w=words.get(word);
                w.count++;

            }else{
                //new word
                Word w = new Word();
                w.count=1;
                w.line=line;
                w.paragraph=paragraph;
                w.word=word;
                words.put(word, w);
            }



        }
    }

    public class Word{
        String word;
        int count;
        int line;
        int paragraph;
    }

enjoy

Upvotes: 0

Razvan Manolescu
Razvan Manolescu

Reputation: 434

You can use a simple TreeMap<String, Integer> for frequency lookups.

Lookups should be O(1), given that the words are short(i.e what you would find a normal text). If you expect lots of unsuccessful lookups (lots of searches for words that don't exist), you could prefilter using a Bloom Filter.

I'd start with a straightforward implementation, and optimize further if needed (parse the stream directly, instead of splitting each line with a separator and reiterating).

Upvotes: 2

Alaa Abuzaghleh
Alaa Abuzaghleh

Reputation: 1009

you can use TreeMap it is very suitable for getting the data ordered. use your word as key and the frequency as value. for example let the following is you paragraph

Java is good language Java is object oriented so I will do the following in order to store each word and its frequency

String s = "Java is good language Java is object oriented"   ; 
String strArr [] = s.split(" ") ; 
TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
for(String str : strArr){
   if(tm.get(str) == null){
         tm.put(str, 1) ; 
   }else{
        int count = tm.get(str) ; 
        count+=1 ; 

   }
}

hopefully this will help you

Upvotes: 0

Related Questions