Reputation: 492
I have a very huge text document. I am implementing "Search" functionality to find occurrences of a given string in the file and to display its position. It is not just whole word search, it can have part of a word / sentance/ paragraph. I am working out on efficient data structure for this process. If it is whole word search I could have used tries/ hash table. I will not be able to use suffix array/ suffix tree as the file size is very large. Sorting is also not that efficient. Other simple option is just to use string search/ regular expression functionality of the framework, which takes linear time. Is there any better known approach for this kind of opeation? Initially it is just string search, later on planning to give search with metacharacters.
Upvotes: 2
Views: 224
Reputation: 1100
Trie and suffix tree/array are a good option but if you do not like them i have another solution: create a hash table:
If you want to find a string you have 2 options:
If the size of the string is lower than N you just search it into the hash table ~O(1) for the search and o(size_of_string) for creating the hash_key
If the size is larger than N you just create chunks of size N and do this: Search a chunk and remember all the position. Than you do the same for the next chunk and check if there are numbers that are consecutive ( ex: first time we have i, j and second time we have k, i+N , than i, i+N is a good combination) save the last number of a consecutive pair(i, i+N, you keep just i+N) and continue until you don't have a number in your Stack or you finished the word
Hope it helped.
Upvotes: 1
Reputation: 1339
Lucene.NET is a search engine library that does text scanning with indexes: http://incubator.apache.org/lucene.net/
Upvotes: 0