Reputation: 187
Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
Upvotes: 3
Views: 186
Reputation: 50826
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.
Upvotes: 2