James Raitsev
James Raitsev

Reputation: 96391

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?

The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"

thank you.

public boolean isAsciiText(String fileName) throws IOException {

    InputStream in = new FileInputStream(fileName);
    byte[] bytes = new byte[500];

    in.read(bytes, 0, bytes.length);
    int x = 0;
    short bin = 0;

    for (byte thisByte : bytes) {
        char it = (char) thisByte;
        if (!Character.isWhitespace(it) && Character.isISOControl(it)) {

            bin++;
        }
        if (bin >= 5) {
            return false;
        }
        x++;
    }
    in.close();
    return true;
}

Upvotes: 6

Views: 10946

Answers (7)

fozzybear
fozzybear

Reputation: 147

One could parse and compare ageinst a list of known binary file header bytes, like the one provided here (and backed up here).

Problem is, one needs to have a sorted list of binary-only headers, and the list might not be complete at all. For example, reading and parsing binary files contained in some Equinox framework jar. If one needs to identify the specific file types though, this should work.

If you're on Linux, for existing files on the disk, native file command execution should work well:

String command = "file -i [ZIP FILE...]";
Process process = Runtime.getRuntime().exec(command);
...

It will output information on the files:

...: application/zip; charset=binary

which you can furtherly filter with grep, or in Java, depending on, if you simply need estimation of the files' binary character, or if you need to find out their MIME types.

If parsing InputStreams, like content of nested files inside archives, this doesn't work, unfortunately, unless resorting to shell-only programs, like unzip - if you want to avoid creating temp unzipped files.

For this, a rough estimation of examining the first 500 Bytes worked out ok for me, so far, as was hinted in the examples above; instead of Character.isWhitespace/isISOControl(char), I used Character.isIdentifierIgnorable(codePoint), assuming UTF-8 default encoding:

private static boolean isBinaryFileHeader(byte[] headerBytes) {
    return new String(headerBytes).codePoints().filter(Character::isIdentifierIgnorable).count() >= 5;
}

public void printNestedZipContent(String zipPath) {
    try (ZipFile zipFile = new ZipFile(zipPath)) {
        int zipHeaderBytesLen = 500;
        zipFile.entries().asIterator().forEachRemaining( entry -> {
            String entryName = entry.getName();
            if (entry.isDirectory()) {
                System.out.println("FOLDER_NAME: " + entryName);
                return;
            }
            // Get content bytes from ZipFile for ZipEntry 
            try (InputStream zipEntryStream = new BufferedInputStream(zipFile.getInputStream(zipEntry))) {
                // read and store header bytes
                byte[] headerBytes = zipEntryStream.readNBytes(zipHeaderBytesLen);
                // Skip entry, if nested binary file
                if (isBinaryFileHeader(headerBytes)) {
                    return;
                }
                // Continue reading zipInputStream bytes, if non-binary
                byte[] zipContentBytes = zipEntryStream.readAllBytes();
                int zipContentBytesLen = zipContentBytes.length;
                // Join already read header bytes and rest of content bytes
                byte[] joinedZipEntryContent = Arrays.copyOf(zipContentBytes, zipContentBytesLen + zipHeaderBytesLen);
                System.arraycopy(headerBytes, 0, joinedZipEntryContent, zipContentBytesLen, zipHeaderBytesLen);
                // Output (default/UTF-8) encoded text file content
                System.out.println(new String(joinedZipEntryContent));
            } catch (IOException e) {
                System.out.println("ERROR getting ZipEntry content: " + entry.getName());
            }
        });
    } catch (IOException e) {
        System.out.println("ERROR opening ZipFile: " + zipPath);
        e.printStackTrace();
    }
}

Upvotes: 0

Pointy
Pointy

Reputation: 413720

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:

if (thisByte < 32 || thisByte > 127) bin++;

edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.

Upvotes: 3

leonbloy
leonbloy

Reputation: 75906

  1. Fails badly if file size is less than 500 bytes

  2. The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.

  3. The return inside the loop (slightly bad practice IMO) forgets to close the file.

Upvotes: 3

Nikolaus Gradwohl
Nikolaus Gradwohl

Reputation: 20124

This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.

why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.

Upvotes: 0

unbeli
unbeli

Reputation: 30228

  1. You ignore what read() returns, what if the files is shorter than 500 bytes?
  2. When you return false, you don't close the file.
  3. When converting byte to char, you assume your file is 7-bit ASCII.

Upvotes: 0

Andrzej Doyle
Andrzej Doyle

Reputation: 103787

The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.

Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.

Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.

Upvotes: 1

Dave
Dave

Reputation: 5173

x doesn't appear to do anything.

What if the file is less than 500 bytes?

Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.

Should handle exception if the file can't be opened or read, etc.

Upvotes: 3

Related Questions