Slow operations in parallel

Question

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time This is the code:

public void decompress(File archive, File destination) throws RuntimeException {
    try (InputStream in = new FileInputStream(archive);
         BufferedInputStream buff = new BufferedInputStream(in);
         TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
    ) {
        TarArchiveEntry entry;
        while ((entry = is.getNextTarEntry()) != null) {
            File file = new File(destination, entry.getName());
            file.getParentFile().mkdirs();
            Files.write(file.toPath(), is.readAllBytes());
        }
    } catch (IOException | ArchiveException e) {
        e.printStackTrace();
    }
}

When I execute one time this operation, it takes ~900ms But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:

ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
    File directory = new File("Dir_" + i);
    EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}

or

File archive = ...;
for (int i = 0; i < 5; i++) {
    File directory = new File("Dir_" + i);
    new Thread(() -> decompress(archive, directory)).start();
}

Joop Eggen · Accepted Answer

One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.

So the version becomes (eschewing File in favor of Path):

public void decompress(File archive, File destination) throws RuntimeException {
    final int bufferSize = 1024 * 128;
    Path archivePath = archive.toPath();
    Path destinationPath = destination.toPath();
    try (InputStream in = new FileInputStream(archive);
         BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
         TarArchiveInputStream is = (TarArchiveInputStream)
             new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
    ) {
        Path oldFileParent = destinationPath;
        oldFileParent.createDirectories();
        TarArchiveEntry entry;
        while ((entry = is.getNextTarEntry()) != null) {
            Path file = Paths.get(destinationPath, entry.getName());
            Path fileParent = file.getParent();
            if (!fileParent.equals(oldFileParent)) {
                 oldFileParent = fileParent;
                 oldFileParent.createDirectories();
            }
            Files.copy(is, file);
            //Files.write(file, is.readAllBytes());
        }
    } catch (IOException | ArchiveException e) {
        e.printStackTrace();
    }
}

Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.

Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.

Better seems to parallelize reading a next file and then in another thread write it. Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.

As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.

Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.

Slow operations in parallel

Answers (2)

Related Questions