Java-Counting occurrence of word from huge textfile

Question

I have a text file of size 115MB. It consists of about 20 million words. I have to use the file as a word collection, and use it to search the occurrence of each user-given words from the collection. I am using this process as a small part in my project. I need a method for finding out the number of occurrence of given words in a faster and correct manner since i may use it in iterations. I am in need of suggestion about any API that i can make use or some other way that performs the task in a quicker manner. Any recommendations are appreciated.

Stephen C · Accepted Answer

This kind of thing is typically implemented using Lucene, especially if you are going to be restarting your application repeatedly or you don't have oodles of memory. Lucene supports lots of other goodies too.

However, if you wanted to "roll your own" code, and you have enough memory (probably 1Gb), your application could:

parse the file into sequence of words,
filter out stopwords,
build a "reverse index" as a HashMap>, where the String values are the unique words, and the List objects give the offsets of the words' occurrences in the file.

It could take a few seconds (or minutes) to process a file that big. But once you've created the in-memory reverse index you can do an occurrence search very quickly. (Maybe sub-microsecond per search.)

Java-Counting occurrence of word from huge textfile

Answers (1)

Related Questions