Reputation: 21
I have to search for a string in 10 large size files (in zip format 70 MB) and have to print the lines with the search string to corresponding 10 output files.(i.e. file 1 output should be in output_file1...file2---> output_file2). The same program takes 15 mins for a single file. But if use 10 threads to read 10 files and to write in 10 different files it should complete in 15 mins but its taking 40 mins.
How can I solve this. Or multithreading will take this much time only?
Upvotes: 2
Views: 316
Reputation: 2480
You might want to look at the "Wide Finder" project Tim Bray created. It sounds very much like what you are doing, and I think examines most if not all of the issues you'll run into. hth
Upvotes: 0
Reputation: 147154
I am going to guess this is a GC issue. I guess you are reading the files a line at a time into a String
. Perhaps you are even recompiling a regex for every line. Anyway, lots of memory allocation, but short lived objects. Multiple threads might tip enough of this in to getting copied into the "survivor" spaces (in typical Sun GC implementation). I guess use visualvm, or an obscure command line argument, to monitor how hard the GC is working.
Could also be a lock contention issue, but this looks embarrassingly parallel.
Upvotes: 0
Reputation: 53830
You probably have hard disk contention, which multi-threading won't help. In your case, you probably want just enough threads to keep the disk drive at 100% usage.
I'm assuming that the hard disk is your bottleneck, not the CPU. Multi-threading only gets things done "faster" if each thread doesn't have to fight over the same hardware. So, with multiple cores (CPUs) and multiple hard drives, you'll see better performance with multi-threading.
I'm surprised that it takes 15 minutes for a single file.
Here's how I would design this. 70 MB is not large. You could probably load each 70 MB uncompressed file into memory, one per thread. Then, unzip the data in real time as you search the compressed stream, keeping a bare amount of uncompressed data in memory. (Once you've searched it, throw it away). This will avoid hard disk thrashing and allow your CPU to reach 100% usage.
If memory is an issue, then load a few MB at a time from disk.
Upvotes: 0
Reputation: 40298
When you run this, is your CPU at 100% already? If not, one of two things;
Upvotes: 0
Reputation:
More threads will most likely just make it run slower since your bottleneck is going to be Disk IO. If you could load all your data into memory first, then you would see some speed up from multiple threads, but only to the point of #cores + 1, anything more would just be context switching overhead.
Upvotes: 0
Reputation: 57707
Accessing files concurrently typically goes slower after 2-3 threads since the hard disk ends up thrashing about trying to read from all files simultaneously, similar to reading a defragmented file.
To avoid this, split the work into file readers and file parsers. The file readers bring in the data from the file (also decompressing), and the file parsers parse the data. You could use PipedInputStream
/PipedOutputStream
to forward data from your file readers to file parsers.
Because your files are zipped, reading involves both I/O and cpu, which can be interleaved nicely across 2-4 threads reading all files. For parsing the files, it's easiest to have just one thread reading from the PipedInputStream, so you will have one parser thread per file. Using multiple threads per file requires splitting up the stream and handling seaching at block boundaries, which complicates the process, and is not necessary here, since you probably have sufficient parallelism with 10 parser threads and 2-4 reader threads.
Upvotes: 5
Reputation: 2136
I suppose you don't have a 10-core-cpu-machine in use - so your threads are not really running parallel. therefore it takes a bit longer than it mathematically should. next thing is you have to be aware that the thread-management also takes a bit of time (which is irrelevant). maybe you can speed up your search-mechanism for the files to gain some speed. for this you would need to post your source-code. but some advises:
but be aware that this measures may result in a very complex code to read for another human being or yourself in ... let's say six month+ because you won't remember everything you did and why you did it (comments ;))
Upvotes: 4