sylvester
sylvester

Reputation: 887

How can I use an index file to increase the speed of searching in a big file?

I have a very big file (more than 800 MB), and I have to search a string which has been entered by a user as fast as possible (almost 5 seconds). I have created an index file which stores a number of each line as well as a number of the first byte of a string in that line. I wanted to use the index file to get the number of the first byte of the string and use it to read the related string in the original file by seek().

Now I have this problem:

If the string which has been entered by the user is the last string in the file, I have to read the whole index file to get the number of its first byte. It takes a lot of time. How can I avoid reading the whole index file? Is there anyone who can help me to use this index file to do searching in my original file in a very short time.

Upvotes: 0

Views: 477

Answers (3)

Yuriy Nemtsov
Yuriy Nemtsov

Reputation: 3915

If you want to do this yourself and if your index fits into memory, then you may consider doing the following:

class Index {
    String beginning;
    long location;
}


public class IndexComparator implements Comparator<Index> {
    @Override
    public int compare(Index o1, Index o2) {
        return o1.beginning.compareTo(o2.beginning);
    }
}

public class Main {
    public static void main(String[] args) {
        IndexComparator comparator = new IndexComparator();

        Index[] indexes = {new Index("one", 10), new Index("two", 20), 
                           new Index("three", 30)};
        Arrays.sort(indexes, comparator);

        int i = Arrays.binarySearch(indexes, new Index("two", 0), comparator);

        System.out.println(indexes[i].location); // prints: 20 (correct location)
    }
}

Upvotes: 0

Stephen C
Stephen C

Reputation: 718758

The simple way to implement this is to use an existing text search engine such as Lucene. This should give you search times measured in milliseconds.

If you want to go to the (not inconsiderable) effort of implementing this yourself, then what you need to do is create a reverse index. This is effectively a multi-map whose keys consist of the words of the input text (your big file), and whose values are the locations of each occurrence of each word. Of course, this will be too big to hold in memory, so you have to design and implement efficient disc-based data structures ... and the software to build and query them.

Upvotes: 2

Suraj Chandran
Suraj Chandran

Reputation: 24791

Tries :Extremely fast( O(m), where m is the lenght of indiviual strings)) But not so good space complexity. There is an open implementaiton in googcle-code here.

Ofcourse you could also use a hashmap, but worse case complexity of search is O(n).

Upvotes: 0

Related Questions