Reputation: 227
I'm trying to implement a search in a 200,000 text files whose size may vary from 50kb to 5 mb which comes to a total of 1.7GB. I'm planning to develop a search engine(Just a sample one). The process is:
1) Extract words from each file and store them in a separate file(40,000,000 words)
2) Search each word in each file ( 40,000,000(words) X 200,000(Files) = 8 X 10^12 searches)
3) Generate boolean Index(650Mb).
So, most of the operation involved here will be searching in document(s) or file(s). In which 2nd step takes a lot of time.(4+ hours)
This is the program I've written for searching a word in JAVA.
count = 0;
BufferedReader reader = new BufferedReader(new FileReader('fileName.txt'));
while ((text = reader.readLine()) != null) {
if( text.indexOf(searchString) != -1 )
{
if( text.equals(searchString))
{
System.out.print('Word Found in line number '+count);
break;
}
}
count++;
}
Program in PYTHON:
count = 0
file = open(filePath)
with file as f :
for line in f:
count += 1
if(line.index(searchWord))
print("Word found in line number"+count)
Output is perfect, but it takes lot of time. Language is not a considering criteria for me. I'm looking for a better performance. Is there any way that I can make it out of it. As most of it is searching process, is there any perfect way coz it is searching large pieces of small chunks.
(My PC Config: 8GB RAM, i7 4th Generation)
Upvotes: 0
Views: 125
Reputation: 533880
Running one of the cheapest laptop I could buy which runs Windows 7.
public class SearchTestMain {
public static void main(String[] args) throws IOException {
File file = new File("deleteme.txt");
PrintWriter pw = new PrintWriter(file);
Random rand = new Random();
int numbers = 42 * 1000 * 1000;
long start = System.currentTimeMillis();
System.out.println("Writing " + file);
// average line length ~36 bytes.
for (int i = 0; i < numbers; i++) {
pw.println(rand.nextLong() & Long.MAX_VALUE); // positive only
pw.println(rand.nextLong() & Long.MAX_VALUE); // positive only
}
pw.close();
long mid = System.currentTimeMillis();
System.out.println("Reading " + file);
BufferedReader br = new BufferedReader(new FileReader(file));
String searchTerm = "31415926";
for (String line; ((line = br.readLine())) != null; )
if (line.contains(searchTerm))
System.out.println("found " + searchTerm + " in " + line);
br.close();
long end = System.currentTimeMillis();
System.out.printf("Writing took %.1f seconds, reading took %.1f seconds for a %,d MB file%n",
(mid - start) / 1e3, (end - mid) / 1e3, file.length() / 1000000);
file.delete();
}
}
prints
Writing deleteme.txt
Reading deleteme.txt
found 31415926 in 6728531415926595287
found 31415926 in 8919165331415926916
... some deleted ...
found 31415926 in 2826331415926854237
found 31415926 in 5676780473141592623
Writing took 35.5 seconds, reading took 55.1 seconds for a 1,753 MB file
I would be very surprised if reading and search the text alone takes much more than a minute. If it takes much longer, it is doing something you are not telling us.
Upvotes: 1
Reputation: 1750
You can try to build an Index using Trie data structure first and then implement search on it.
Upvotes: 0
Reputation: 4252
You can split your file into multiple chunks & then Process those chunks parallely using different Threads. (Similar to Map Reduce)
Example : Split file in chunks of 100MB each (say there are 17 chunks)
Now you can pass these chunks to individual threads and then search for the text.
public class SearchText
{
public void processFile()
{
List<Chunks> totalChunks = splitFile();
// you have to implement splitFile() function to split file in chunks
for(Chunks chunk : totakChunks)
{
// Create a new Thread and process the chunks
new Thread(new ChunkProcessor(chunk)).start();
}
}
}
public class ChunkProcessor implements Runnable
{
private Chunk mychunk ;
public ChunkProcessor(Chunk chunk)
{
myChunk = chunk;
}
public void run()
{
// search for text in this chunk
}
}
Upvotes: 3