Algorist
Algorist

Reputation: 492

Efficient approach for File search functionality

I have a very huge text document. I am implementing "Search" functionality to find occurrences of a given string in the file and to display its position. It is not just whole word search, it can have part of a word / sentance/ paragraph. I am working out on efficient data structure for this process. If it is whole word search I could have used tries/ hash table. I will not be able to use suffix array/ suffix tree as the file size is very large. Sorting is also not that efficient. Other simple option is just to use string search/ regular expression functionality of the framework, which takes linear time. Is there any better known approach for this kind of opeation? Initially it is just string search, later on planning to give search with metacharacters.

Upvotes: 2

Views: 224

Answers (2)

Mark
Mark

Reputation: 1100

Trie and suffix tree/array are a good option but if you do not like them i have another solution: create a hash table:

  • Create a hash table for all the strings of length 1, 2, 3, .. N where N is whatever number you want complexity O(N * size_of_text)
  • If you want to find a string you have 2 options:

    If the size of the string is lower than N you just search it into the hash table ~O(1) for the search and o(size_of_string) for creating the hash_key
    If the size is larger than N you just create chunks of size N and do this: Search a chunk and remember all the position. Than you do the same for the next chunk and check if there are numbers that are consecutive ( ex: first time we have i, j and second time we have k, i+N , than i, i+N is a good combination) save the last number of a consecutive pair(i, i+N, you keep just i+N) and continue until you don't have a number in your Stack or you finished the word
    Hope it helped.

Upvotes: 1

Jared Shaver
Jared Shaver

Reputation: 1339

Lucene.NET is a search engine library that does text scanning with indexes: http://incubator.apache.org/lucene.net/

Upvotes: 0

Related Questions