Reputation: 6051

Multithreading approach to find text pattern in files

Consider simple Java application which should traverse files tree in a disc to find specific pattern in the body of the file.

Wondering is it possible to achieve better performance, using multi-threading, for example when we find new folder we submit new Runnable in fixed ThreadPool. Runnable task should traverse folder to find out new folders etc. In my opinion this operation should be IO bound, not CPU bound, so spawning new Thread would not improve performance.

Does it depends of hard drive type ? ( hdd, ... etc) Does it depends on OS type ?

IMHO the only thing that can be parallel is - to spawn new Thread for parsing file content to find out pattern in file body.

What is the common pattern to solve this problem it ? Should it be multi-threaded or single-threaded ?

Upvotes: 0

Answers (4)

OldCurmudgeon

Reputation: 65811

I did some experiments on just this question some time ago. In the end I concluded that I could achieve a far better improvement by changing the way I accessed the file.

Here's the file walker I eventually ended up using:

// 4k buffer size ... near-optimal for Windows.
static final int SIZE = 4 * 1024;

// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter h, FileInputStream f) throws FileNotFoundException, IOException {
  // Use a mapped and buffered stream for best speed.
  // See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
  FileChannel ch = f.getChannel();
  // How much I've read.
  long red = 0L;
  do {
    // How much to read this time around. 
    long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
    // Map a byte buffer to the file.
    MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
    // How much to get.
    int nGet;
    // Walk the buffer to the end or until the hunter has finished.
    while (mb.hasRemaining() && h.ok()) {
      // Get a max of 4k.
      nGet = Math.min(mb.remaining(), SIZE);
      // Get that much.
      mb.get(buffer, 0, nGet);
      // Offer each byte to the hunter.
      for (int i = 0; i < nGet && h.ok(); i++) {
        h.check(buffer[i]);
      }
    }
    // Keep track of how far we've got.
    red += read;
    // Stop at the end of the file.
  } while (red < ch.size() && h.ok());
  // Finish off.
  h.close();
  ch.close();
  f.close();
}

Upvotes: 2

4ndrew

Reputation: 16212

I've performed some research in this area while working under test project, you can look at the project on github at: http://github.com/4ndrew/filesearcher. Of course the main problem is a disk I/O speed, but if you would use optimal count of threads to perform reading/searching in parallel you can get better results in common.

UPD: Also look at this article http://drdobbs.com/parallel/220300055

Upvotes: 2

Tudor

Reputation: 62439

What you could do is this: implement a single-producer multi-consumer pattern, where one thread searches the disk, retrieves files and then the consumer threads process them.

You are right that in this case using multiple threads to scan the disk would not be beneficial, in fact it would probably degrade performance, since the disk needs to seek the next reading position every time, so you end up bouncing the disk between the threads.

Upvotes: 1

maximdim

Reputation: 8169

You stated it right that you need to determine if your task is CPU or IO bound and then decide could it benefit from multithreading or not. Generally disk operations are pretty slow so unless amount of data you need to parse and parsing complexity you might not benefit from multithreading much. I would just write a simple test - just to read files w/o parsing in single thread, measure it and then add parsing and see if it's much slower and then decide.

Perhaps good design would be to use two threads - one reader thread that reads the files and places data in (bounded) queue and then another thread (or better use ExecutorService) parses the data - it would give you nice separation of concerns and you could always tweak number of threads doing parsing. I'm not sure if it makes much sense to read disk with multiple threads (unless you need to read from multiple physical disks etc).

Upvotes: 1

Multithreading approach to find text pattern in files

Answers (4)

Related Questions