Reputation: 129

Counting number of times a given word appears in a text corpus of greater than 1 billion words

I am designing a program where the user enters a word and the program determines how many times the word has appeared in the text corpus. Now the text corpus is too large to fit in memory and to optimize the memory I have decided to use an on disk data structure. I wanted to use a hash table for insertion and search but don't know how to design it for the disk. How can I design the key such that it takes constant time for looking up the value for the key on disk. Should I create separate files for a particular subset of keys such that look up is O(1)? I do know that B Trees exist but how can you design such a hash table for such an application? Thanks in advance for your answers!

Upvotes: 0

Answers (3)

errantlinguist

Reputation: 3818

As pjs said in a comment above, the actual memory footprint required to store the counts of one billion tokens will likely be surprisingly small: Natural language (and a lot of other things) follow Zipf's law, which basically says that your most common word will be far more common than the second-most common, which is far more common than the third-most common and so on. Therefore, a huge amount of those one billion tokens will be a and the, assuming you're doing this for English:

In other words, just try using an unsorted_map<string, uint_least32_t> first and see how that works.

Experiment: Actual size in memory

Since you mentioned that the solution can occupy at most 2 MB of memory, I decided to see if an unsorted_map<string, uint_least32_t> could hold all the types and their counts required. First, I used Python's NLTK to get the number of unique words in the Brown corpus:

from nltk.corpus import brown

token_types = set(word.lower() for word in brown.words())
print len(token_types)

This gave me a result of 49815 unique words. I then created an unsorted_map<string, uint_least32_t> with 49815 keys and then estimated its size by modifying a solution from a related question:

#include <cstdint>
#include <iostream>
#include <string>
#include <unordered_map>

using namespace std;

// Using uint_least32_t for token counts because uint_least16_t might be a bit too narrow for counting frequencies
typedef unordered_map<string, uint_least32_t> TokenFrequencyMap;

static size_t estimateMemoryUsage(const TokenFrequencyMap& map)
{
  size_t entrySize = sizeof(TokenFrequencyMap::key_type) + sizeof(TokenFrequencyMap::mapped_type) + sizeof(void*);
  size_t bucketSize = sizeof(void*);
  size_t adminSize = 3 * sizeof(void*) + sizeof(TokenFrequencyMap::size_type);

  return adminSize + map.size() * entrySize + map.bucket_count() * bucketSize;
}

int main()
{
    constexpr TokenFrequencyMap::size_type vocabSize = 49815;
    TokenFrequencyMap counts;
    counts.reserve(vocabSize);
    for (TokenFrequencyMap::size_type i = 0; i < vocabSize; ++i)
    {
        string token = to_string(rand());
        uint_least32_t count = rand();
        counts[token] = count;
    }
    size_t memoryUsage = estimateMemoryUsage(counts);
    cout << memoryUsage << endl;

    return EXIT_SUCCESS;
}

On my system (x86_64-linux-gnu with GCC 4.8.4 with the flags -fexceptions -march=corei7 -O2 -std=c++11), this outputs 1421940 bytes, which is roughly 1.36 MB. Therefore, assuming that the distribution of your text is similar to that of the Brown corpus, you should have no problems using an in-memory solution implemented with an unsorted_map<string, uint_least32_t>.

Upvotes: 3

Jim Mischel

Reputation: 134005

Whether this can be done within your 2MB memory requirement depends on the number of distinct words in your corpus. If you use the Brown corpus mentioned in a previous answer, you have:

49,815 words at 8.075 characters average length = 402,256 bytes
49,815 counts at 4 bytes per count = 199,260 bytes

If you were to pack that all in a character array so that you could sequentially search it, you would need to add another 49,815 nul terminators. The structure would be:

word,\0,count,word,\0,count . . .

That would require a total of 651,331 bytes. So you know at least that your raw data will fit in memory.

You could get creative with that and add a sorted index with an additional 49,815 pointers into that array. That would cost you another 199,260 bytes and give you O(log₂(n)) lookup. Considering the small number of keys, that would be pretty darned quick lookup. Not constant, but very good, and it fits in less than a megabyte.

If you want constant lookup time, you could generate a Minimal perfect hash for the keys. You then replace the sorted index I mention above with an array of pointers. There's no need to store a key. The minimal perfect hash generates a number from 0 to n; call it k. You go to the kth index in the array to retrieve a pointer, p into the flat array.

Generating the hash function shouldn't take too long. In this article, the author claims that he created a minimal perfect has function for 100,000 words in about 2.5 seconds. You can build that during preprocessing, or you could have the program compute it at startup.

That all should fit in under a megabyte of space, and should perform faster than a standard map because it guarantees that there are no collisions. So no bucket contains more than one value. Memory allocation overhead, too, is minimized, because there are only two allocations: one for the raw data array, and one for the index array.

Upvotes: 3

user1196549

Reputation:

How about using a trie ? You will create a file of identical records (set of integer indexes, one per alphabet letter) seen as a large array, so that radom access is possible. You will ever need to process one node at a time, so no worries for RAM space. This is space hungry, but implementation is easy.

Upvotes: 1

Counting number of times a given word appears in a text corpus of greater than 1 billion words

Answers (3)

Experiment: Actual size in memory

Related Questions