Reputation: 1030
The linux file
command does a very good job in recognising file types and gives very fine-grained results. The diff
tool is able to tell binary files from text files, producing a different output.
Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff
will attempt a text-based comparison.
To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.
Upvotes: 13
Views: 21237
Reputation: 2474
This approach defers to the grep
command in determining whether a file is binary or text:
is_text_file() { grep -qIF '' "$1"; }
-q
Quiet; Exit immediately with zero status if any match is found-I
Process a binary file as if it did not contain matching data-F
Interpret PATTERNS as fixed strings, not regular expressions.''
Empty string. All files (except an empty file)
will match this pattern.file
command agrees with this assessment.)a
, is considered a text file according to this test. (Makes sense to me.) (The file
command disagrees with this assessment. (Tested with GNU file
))# cd into a temp directory
cd "$(mktemp -d)"
# Create 3 corner-case test files
touch empty_file # An empty file
echo -n a >one_byte_a # A file containing just `a`
echo a >one_line_a # A file containing just `a` and a newline
# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB
# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1
# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }
# Test harness
do_test() {
printf '%22s ... ' "$1"
if is_text_file "$1"; then
echo "is a text file"
else
echo "is a binary file"
fi
}
# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1
empty_file ... is a binary file
one_byte_a ... is a text file
one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file
On my machine, it seems grep checks the first 96 KiB of a file for a NUL
. (Tested with GNU grep
). The exact crossover point depends on your machine's page size.
Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550
Upvotes: 4
Reputation: 413
A fast way to do this in ubuntu is use nautilus in the "list" view. The type column will show you if its text or binary
Upvotes: 0
Reputation: 281875
A quick-and-dirty way is to look for a NUL
character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL
.
Update: According to the diff manual, this is exactly what diff does.
Upvotes: 6
Reputation: 1
Commands like less, grep detect it quite easily(and fast). You can have a look at their source.
Upvotes: -1
Reputation: 27864
These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.
See here for how Subversion does it.
Upvotes: 1
Reputation: 59375
The diff manual specifies that
diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.
Upvotes: 7
Reputation: 76770
file
is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file
; anything that is binary will not include the word "text".
If you don't agree with the heuristics that file
uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file
does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).
Upvotes: 13
Reputation: 4733
You could try to give a
strings yourfile
command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.
Upvotes: 3