L. Angelino
L. Angelino

Reputation: 43

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time This is the code:

public void decompress(File archive, File destination) throws RuntimeException {
    try (InputStream in = new FileInputStream(archive);
         BufferedInputStream buff = new BufferedInputStream(in);
         TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
    ) {
        TarArchiveEntry entry;
        while ((entry = is.getNextTarEntry()) != null) {
            File file = new File(destination, entry.getName());
            file.getParentFile().mkdirs();
            Files.write(file.toPath(), is.readAllBytes());
        }
    } catch (IOException | ArchiveException e) {
        e.printStackTrace();
    }
}

When I execute one time this operation, it takes ~900ms But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:

ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
    File directory = new File("Dir_" + i);
    EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}

or

File archive = ...;
for (int i = 0; i < 5; i++) {
    File directory = new File("Dir_" + i);
    new Thread(() -> decompress(archive, directory)).start();
}

enter image description here

Upvotes: 0

Views: 277

Answers (2)

Joop Eggen
Joop Eggen

Reputation: 109593

  • One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
  • The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
  • You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.

So the version becomes (eschewing File in favor of Path):

public void decompress(File archive, File destination) throws RuntimeException {
    final int bufferSize = 1024 * 128;
    Path archivePath = archive.toPath();
    Path destinationPath = destination.toPath();
    try (InputStream in = new FileInputStream(archive);
         BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
         TarArchiveInputStream is = (TarArchiveInputStream)
             new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
    ) {
        Path oldFileParent = destinationPath;
        oldFileParent.createDirectories();
        TarArchiveEntry entry;
        while ((entry = is.getNextTarEntry()) != null) {
            Path file = Paths.get(destinationPath, entry.getName());
            Path fileParent = file.getParent();
            if (!fileParent.equals(oldFileParent)) {
                 oldFileParent = fileParent;
                 oldFileParent.createDirectories();
            }
            Files.copy(is, file);
            //Files.write(file, is.readAllBytes());
        }
    } catch (IOException | ArchiveException e) {
        e.printStackTrace();
    }
}

Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.

Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.

Better seems to parallelize reading a next file and then in another thread write it. Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.

As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.

Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.

Upvotes: 1

vanxa
vanxa

Reputation: 21

Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared. Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

Upvotes: 0

Related Questions