user1919035
user1919035

Reputation: 227

Search in a file with good performance

I'm trying to implement a search in a 200,000 text files whose size may vary from 50kb to 5 mb which comes to a total of 1.7GB. I'm planning to develop a search engine(Just a sample one). The process is:

1) Extract words from each file and store them in a separate file(40,000,000 words)
2) Search each word in each file ( 40,000,000(words) X 200,000(Files) = 8 X 10^12 searches)
3) Generate boolean Index(650Mb).

So, most of the operation involved here will be searching in document(s) or file(s). In which 2nd step takes a lot of time.(4+ hours)

This is the program I've written for searching a word in JAVA.

count = 0;
BufferedReader reader = new BufferedReader(new FileReader('fileName.txt'));
while ((text = reader.readLine()) != null) {
if( text.indexOf(searchString) != -1 )
{
    if( text.equals(searchString))
    {
        System.out.print('Word Found in line number '+count);
        break;
    }
}
count++;
}

Program in PYTHON:

count = 0
file = open(filePath)
with file as f :
    for line in f:
        count += 1
        if(line.index(searchWord))
            print("Word found in line number"+count)

Output is perfect, but it takes lot of time. Language is not a considering criteria for me. I'm looking for a better performance. Is there any way that I can make it out of it. As most of it is searching process, is there any perfect way coz it is searching large pieces of small chunks.

(My PC Config: 8GB RAM, i7 4th Generation)

Upvotes: 0

Views: 125

Answers (3)

Peter Lawrey
Peter Lawrey

Reputation: 533880

Running one of the cheapest laptop I could buy which runs Windows 7.

public class SearchTestMain {
    public static void main(String[] args) throws IOException {
        File file = new File("deleteme.txt");
        PrintWriter pw = new PrintWriter(file);
        Random rand = new Random();
        int numbers = 42 * 1000 * 1000;
        long start = System.currentTimeMillis();
        System.out.println("Writing " + file);
        // average line length ~36 bytes.
        for (int i = 0; i < numbers; i++) {
            pw.println(rand.nextLong() & Long.MAX_VALUE); // positive only
            pw.println(rand.nextLong() & Long.MAX_VALUE); // positive only
        }
        pw.close();
        long mid = System.currentTimeMillis();

        System.out.println("Reading " + file);
        BufferedReader br = new BufferedReader(new FileReader(file));
        String searchTerm = "31415926";
        for (String line; ((line = br.readLine())) != null; )
            if (line.contains(searchTerm))
                System.out.println("found " + searchTerm + " in " + line);
        br.close();
        long end = System.currentTimeMillis();
        System.out.printf("Writing took %.1f seconds, reading took %.1f seconds for a %,d MB file%n",
                (mid - start) / 1e3, (end - mid) / 1e3, file.length() / 1000000);
        file.delete();
    }
}

prints

Writing deleteme.txt
Reading deleteme.txt
found 31415926 in 6728531415926595287
found 31415926 in 8919165331415926916
... some deleted ...
found 31415926 in 2826331415926854237
found 31415926 in 5676780473141592623
Writing took 35.5 seconds, reading took 55.1 seconds for a 1,753 MB file

I would be very surprised if reading and search the text alone takes much more than a minute. If it takes much longer, it is doing something you are not telling us.

Upvotes: 1

genonymous
genonymous

Reputation: 1750

You can try to build an Index using Trie data structure first and then implement search on it.

Upvotes: 0

Kakarot
Kakarot

Reputation: 4252

You can split your file into multiple chunks & then Process those chunks parallely using different Threads. (Similar to Map Reduce)

Example : Split file in chunks of 100MB each (say there are 17 chunks)

Now you can pass these chunks to individual threads and then search for the text.

public class SearchText
{

  public void processFile()
  {
    List<Chunks> totalChunks = splitFile(); 
    // you have to implement splitFile() function to split file in chunks

    for(Chunks chunk : totakChunks)
    {
       // Create a new Thread and process the chunks
       new Thread(new ChunkProcessor(chunk)).start();
    }
  }
}

public class ChunkProcessor implements Runnable
{

   private Chunk mychunk ;
   public ChunkProcessor(Chunk chunk)
   {
     myChunk = chunk;
   }


   public void run()
   {
      // search for text in this chunk
   } 
}

Upvotes: 3

Related Questions