Jean-Pierre Coffe
Jean-Pierre Coffe

Reputation: 95

Best index data structure for querying text files

I have a very large text file (50GB+). I would like to perform searches of this text file, specifically:

I can build an index of this file but it needs to remain somewhat small (of the order of the size of the input text file). What is the fastest/optimal data structure or index to solve this problem?

Upvotes: 1

Views: 91

Answers (1)

mandy8055
mandy8055

Reputation: 6735

A Suffix Array or a Compressed Suffix Array (CSA) would be an optimal choice for your requirement. Suffix Arrays are a data structure that allows for fast pattern matching and can be stored compactly, making them a suitable choice for large text files.

A Suffix Array is a sorted array of all the suffixes of a given text. It can be built in O(n log n) time, where n is the length of the text. To search for an exact match of a query string in the Suffix Array, you can use binary search, which takes O(m log n) time, where m is the length of the query string.

Compressed Suffix Array (CSA) comes to picture if you need a more space-efficient representation of the Suffix Array. The CSA reduces space usage significantly while still allowing for fast search operations.

To build a Compressed Suffix Array, you can:

  1. Construct the Suffix Array for the given text file.
  2. Compress the Suffix Array using techniques like Run-Length Encoding (RLE), Burrows-Wheeler Transform (BWT), or any other compression algorithms. I would go with any of the first two.

Once you have the Compressed Suffix Array, you can perform searches quickly by decompressing only the necessary parts of the array. This way, you can achieve a good balance between space usage and search performance.

REFERENCES AND USEFUL READS:

  1. https://www.geeksforgeeks.org/suffix-array-set-1-introduction/
  2. https://cs.stackexchange.com/a/69677
  3. https://link.springer.com/chapter/10.1007/978-3-642-16321-0_36

Upvotes: 1

Related Questions