Reputation: 43
I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}
Upvotes: 0
Views: 277
Reputation: 109593
File.mkdirs
does needlessly much checks.BufferedInputStream
may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."Files.copy
but still, it might have a better memory behavior that readAllBytes
.So the version becomes (eschewing File
in favor of Path
):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)
) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes
might then be appropriate, to let the writing thread not use is
.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes
is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger
is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.
Upvotes: 1
Reputation: 21
Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger
you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger
is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).
Upvotes: 0