Reputation: 4308
A client is supposed to upload a compressed file into an S3 folder. Then the compressed file is downloaded and decompressed to perform various operations on its contained files. Originally we told our client to compress its files into a ZIP file, but this proved too difficult for our client. Instead it submitted a RAR file with ZIP extension... how clever. For obvious reasons one can't decompress a RAR file using a ZIP decompressing algorithm.
So, I'm looking for a way to find out the file type of the S3 downloaded files given that I'm working on a Java project with Amazon's SDK on a Linux OS. I'll take care of how to decompress the file depending on the obtained file type.
I've looked at many stack overflow questions, like this one, but none seem 100% effective just by looking at them (and its comments).
What would be the best approach to find out the compressed file's type?
Upvotes: 1
Views: 4698
Reputation: 4308
When one uploads a file to Amazon S3 programatically, one could specify the object's Content-Type
. If one specifies none, as @Michael-bot clarifies, the value assigned by default will be binary/octet-stream
. Or if one decides to upload the file through Amazon S3's GUI, the file gets its Content-Type
from its file extension (sadly, not its contents). If you can trust whoever uploaded the file to set the Content-Type
correctly, go ahead and look at the ObjectMetadata
, but if you can't (like me), you would need another solution.
So, if you are looking for a solution that works on the most common file compression types, Files.probeContentType, Apache Tika and SimpleMagic seem to be acceptable solutions.
In the end I chose Files.probeContentType
as it required no extra libraries and works just fine on a Linux machine (as long as the file doesn't have the wrong extension, for which there is a workaround: remove the file extension and let it do its magic).
At first one would think that the response object when downloading the file from Amazon's S3 includes the file type. And it does contain this information, but the problem arises when the extension of the file doesn't match its contents.
import com.amazonaws.services.s3.model.S3Object;
final S3Object s3Object = ...;
final String contentType = s3Object.getObjectMetadata().getContentType();
This code would return application/zip
even if the contents of the file are of a Rar file. So this solution doesn't work for me.
For this reason I took the time to build a sample project that tested various scenarios with the different approaches and libraries available. I'm using Java 8 by the way.
The files types tested are:
Beware, the implementations presented here are only for testing purposes. They are not in any way endorsed to be used in production code, as they don't consider file locking problems among other things that my imagination couldn't bother to consider. =)
import java.io.File;
import javax.activation.MimetypesFileTypeMap;
final File file = new File(basePath + "/" + fileName);
try {
return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/octet-stream
Rar with Zip extension is: application/octet-stream
Zip with Zip extension is: application/octet-stream
7z with 7z extension is: application/octet-stream
7z with Zip extension is: application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is: application/octet-stream
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is: application/octet-stream
Rar without extension is: application/octet-stream
Zip without extension is: application/octet-stream
7z without extension is: application/octet-stream
Tar.xz without extension is: application/octet-stream
Tar.gz without extension is: application/octet-stream
The value returned by this approach when a file type has not been recognized is application/octet-stream
. It seems all scenarios failed so we should discard this approach.
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.net.URLConnection;
final File file = new File(basePath + "/" + fileName);
try {
final FileInputStream fileInputStream = new FileInputStream(file);
final InputStream inputStream = new BufferedInputStream(fileInputStream);
return URLConnection.guessContentTypeFromStream(inputStream);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: null
Rar with Zip extension is: null
Zip with Zip extension is: null
7z with 7z extension is: null
7z with Zip extension is: null
Tar.xz with Tar.xz extension is: null
Tar.xz with Zip extension is: null
Tar.gz with Tar.gz extension is: null
Tar.gz with Zip extension is: null
Rar without extension is: null
Zip without extension is: null
7z without extension is: null
Tar.xz without extension is: null
Tar.gz without extension is: null
Again, this method fails all scenarios. It seems its support is very limited.
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
try {
final Path path = Paths.get(basePath + "/" + fileName);
return Files.probeContentType(path);
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/vnd.rar
Rar with Zip extension is: application/zip
Zip with Zip extension is: application/zip
7z with 7z extension is: application/x-7z-compressed
7z with Zip extension is: application/zip
Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
Tar.xz with Zip extension is: application/zip
Tar.gz with Tar.gz extension is: application/x-compressed-tar
Tar.gz with Zip extension is: application/zip
Rar without extension is: application/vnd.rar
Zip without extension is: application/zip
7z without extension is: application/x-7z-compressed
Tar.xz without extension is: application/x-xz
Tar.gz without extension is: application/gzip
This method worked surprisingly well, but don't be fooled, there is a scenario where it consistently fails. If a file has the wrong extension (one that doesn't match is content) it will report the file type to be the extension. It should not happen very often, but if one is very picky this method is not to be used.
Also, some warn that his approach doesn't work well in Windows.
Workaround: If one manages to remove the extension from the filename, this would return the proper value for all the given scenarios.
There seem to be many flavors of this library (app, server, eval, etc), but many around the web complain about it being somewhat "dependency-heavy".
import org.apache.tika.Tika;
try {
return new Tika().detect(new File(basePath + "/" + fileName));
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/x-rar-compressed
Rar with Zip extension is: application/x-rar-compressed
Zip with Zip extension is: application/zip
7z with 7z extension is: application/x-7z-compressed
7z with Zip extension is: application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is: application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is: application/gzip
Rar without extension is: application/x-rar-compressed
Zip without extension is: application/zip
7z without extension is: application/x-7z-compressed
Tar.xz without extension is: application/x-xz
Tar.gz without extension is: application/gzip
All files were properly identified, but as it has its advantages it also has its disadvantages.
Pros:
Cons:
import java.net.URL;
import java.net.URLConnection;
try {
final URL url = new URL("file://" + basePath + "/" + fileName);
final URLConnection urlConnection = url.openConnection();
return urlConnection.getContentType();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: content/unknown
Rar with Zip extension is: application/zip
Zip with Zip extension is: application/zip
7z with 7z extension is: content/unknown
7z with Zip extension is: application/zip
Tar.xz with Tar.xz extension is: content/unknown
Tar.xz with Zip extension is: application/zip
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is: application/zip
Rar without extension is: content/unknown
Zip without extension is: content/unknown
7z without extension is: content/unknown
Tar.xz without extension is: content/unknown
Tar.gz without extension is: content/unknown
It hardly identifies any file compression format, and guides itself by the extension, not its contents.
This project seems to be updated at least once a year.
import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;
try {
final ContentInfoUtil util = new ContentInfoUtil();
final ContentInfo info = util.findMatch(basePath + "/" + fileName);
return info.getMimeType();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/x-rar
Rar with Zip extension is: application/x-rar
Zip with Zip extension is: application/zip
7z with 7z extension is: application/x-7z-compressed
7z with Zip extension is: application/x-7z-compressed
Tar.xz with Tar.xz extension is: <EXCEPTION: null>
Tar.xz with Zip extension is: <EXCEPTION: null>
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is: application/x-gzip
Rar without extension is: application/x-rar
Zip without extension is: application/zip
7z without extension is: application/x-7z-compressed
Tar.xz without extension is: <EXCEPTION: null>
Tar.gz without extension is: application/x-gzip
It worked for almost all our scenarios, but it seems that for the most "obscure" compression formats like Tar.xz it failed to detect them (and threw an exception in the process).
This project has not been modified since 2010, so don't expect support or updates. It is just listed here for the sake of completion.
import eu.medsea.mimeutil.MimeUtil2;
try {
final MimeUtil2 mimeUtil = new MimeUtil2();
mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/x-rar
Rar with Zip extension is: application/x-rar
Zip with Zip extension is: application/zip
7z with 7z extension is: application/octet-stream
7z with Zip extension is: application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is: application/octet-stream
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is: application/x-gzip
Rar without extension is: application/x-rar
Zip without extension is: application/zip
7z without extension is: application/octet-stream
Tar.xz without extension is: application/octet-stream
Tar.gz without extension is: application/x-gzip
It identifies some of the most popular file types, but fails with Tar.xz and 7z.
Not the prettiest solution, but it had to be tried: Ubuntu file command.
import java.io.BufferedReader;
import java.io.InputStreamReader;
try {
final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);
final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));
String text = "";
String s;
while ((s = stdInput.readLine()) != null) {
text += s;
}
return text.split(": ")[1];
} catch (final Exception exception) {
return "<EXCEPTION: " + exception.getMessage() + ">";
}
Rar with Rar extension is: application/x-rar
Rar with Zip extension is: application/x-rar
Zip with Zip extension is: application/zip
7z with 7z extension is: application/x-7z-compressed
7z with Zip extension is: application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is: application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is: application/gzip
Rar without extension is: application/x-rar
Zip without extension is: application/zip
7z without extension is: application/x-7z-compressed
Tar.xz without extension is: application/x-xz
Tar.gz without extension is: application/gzip
It works for all our scenarios, but again, this relies on the command File
being present on the System running the code.
Upvotes: 5