Reputation: 95
I have a very large text file (50GB+). I would like to perform searches of this text file, specifically:
I can build an index of this file but it needs to remain somewhat small (of the order of the size of the input text file). What is the fastest/optimal data structure or index to solve this problem?
Upvotes: 1
Views: 91
Reputation: 6735
A Suffix Array or a Compressed Suffix Array (CSA) would be an optimal choice for your requirement. Suffix Arrays are a data structure that allows for fast pattern matching and can be stored compactly, making them a suitable choice for large text files.
A Suffix Array is a sorted array of all the suffixes of a given text. It can be built in O(n log n)
time, where n
is the length of the text. To search for an exact match of a query string in the Suffix Array, you can use binary search, which takes O(m log n)
time, where m
is the length of the query string.
Compressed Suffix Array (CSA) comes to picture if you need a more space-efficient representation of the Suffix Array. The CSA reduces space usage significantly while still allowing for fast search operations.
To build a Compressed Suffix Array, you can:
Once you have the Compressed Suffix Array, you can perform searches quickly by decompressing only the necessary parts of the array. This way, you can achieve a good balance between space usage and search performance.
REFERENCES AND USEFUL READS:
Upvotes: 1