Best index data structure for querying text files

Question

I have a very large text file (50GB+). I would like to perform searches of this text file, specifically:

Be able to return all exact occurrences of a query string, and what follows it in sequence.
At each time point, I want to perform a search for a single query string. I don't care about multiple query matching.
I don't care about pattern matching. I'm only interested in exact matches.
I also want to be able to find all occurrences and their locations in the text file very fast.

I can build an index of this file but it needs to remain somewhat small (of the order of the size of the input text file). What is the fastest/optimal data structure or index to solve this problem?

mandy8055 · Accepted Answer

A Suffix Array or a Compressed Suffix Array (CSA) would be an optimal choice for your requirement. Suffix Arrays are a data structure that allows for fast pattern matching and can be stored compactly, making them a suitable choice for large text files.

A Suffix Array is a sorted array of all the suffixes of a given text. It can be built in O(n log n) time, where n is the length of the text. To search for an exact match of a query string in the Suffix Array, you can use binary search, which takes O(m log n) time, where m is the length of the query string.

Compressed Suffix Array (CSA) comes to picture if you need a more space-efficient representation of the Suffix Array. The CSA reduces space usage significantly while still allowing for fast search operations.

To build a Compressed Suffix Array, you can:

Construct the Suffix Array for the given text file.
Compress the Suffix Array using techniques like Run-Length Encoding (RLE), Burrows-Wheeler Transform (BWT), or any other compression algorithms. I would go with any of the first two.

Once you have the Compressed Suffix Array, you can perform searches quickly by decompressing only the necessary parts of the array. This way, you can achieve a good balance between space usage and search performance.

REFERENCES AND USEFUL READS:

Best index data structure for querying text files

Answers (1)

Related Questions