gabor
gabor

Reputation: 1030

How to tell binary from text files in linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.

Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.

To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

Upvotes: 13

Views: 21237

Answers (8)

Robin A. Meade
Robin A. Meade

Reputation: 2474

This approach defers to the grep command in determining whether a file is binary or text:

is_text_file() { grep -qIF '' "$1"; }

grep options used:

  • -q Quiet; Exit immediately with zero status if any match is found
  • -I Process a binary file as if it did not contain matching data
  • -F Interpret PATTERNS as fixed strings, not regular expressions.

grep pattern used:

  • '' Empty string. All files (except an empty file) will match this pattern.

Notes

  • An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
  • A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
  • This approach requires only one child process to test whether a file is text or binary.

Test

# cd into a temp directory
cd "$(mktemp -d)"

# Create 3 corner-case test files
touch empty_file       # An empty file
echo -n a >one_byte_a  # A file containing just `a`
echo a >one_line_a     # A file containing just `a` and a newline

# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB

# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1

# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }

# Test harness
do_test() {
  printf '%22s ... ' "$1"
  if is_text_file "$1"; then
    echo "is a text file"
  else
    echo "is a binary file"
  fi
}

# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1

Output

            empty_file ... is a binary file
            one_byte_a ... is a text file
            one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file

On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.

Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550

Upvotes: 4

yoshi
yoshi

Reputation: 413

A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary

Upvotes: 0

RichieHindle
RichieHindle

Reputation: 281875

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.

Update: According to the diff manual, this is exactly what diff does.

Upvotes: 6

Raghu
Raghu

Reputation: 1

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Upvotes: -1

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.

See here for how Subversion does it.

Upvotes: 1

David Schmitt
David Schmitt

Reputation: 59375

The diff manual specifies that

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.

Upvotes: 7

Tyler McHenry
Tyler McHenry

Reputation: 76770

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".

If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).

Upvotes: 13

Simone Margaritelli
Simone Margaritelli

Reputation: 4733

You could try to give a

strings yourfile

command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.

Upvotes: 3

Related Questions