Reputation: 22010
I know that Git somehow automatically detects if a file is binary or text and that .gitattributes
can be used to set this manually if needed. But is there also a way to ask Git how it treats a file?
So let's say I have a Git repository with two files in it: An ascii.dat
file containing plain-text and a binary.dat
file containing random binary stuff. Git handles the first .dat
file as text and the secondary file as binary. Now I want to write a Git web front end which has a viewer for text files and a special viewer for binary files (displaying a hex dump for example). Sure, I could implement my own text/binary check but it would be more useful if the viewer relies on the information how Git handles these files.
So how can I ask Git if it treats a file as text or binary?
Upvotes: 109
Views: 30081
Reputation: 1323263
So how can I ask Git if it treats a file as text or binary?
Not only git check-attr --all
is a good option, but with Git 2.40 (Q1 2023), "git check-attr
"(man) learned to take an optional tree-ish to read the .gitattributes
file from.
That means you can Git if it treats a file as text or binary, for any commit, not just the current HEAD!
git check-attr --all --source=@~2 -- myFile
git check-attr --all --source=anotherBranch -- myFile
See commit 47cfc9b, commit c847e8c (14 Jan 2023) by Karthik Nayak (KarthikNayak
).
(Merged by Junio C Hamano -- gitster
-- in commit 577bff3, 23 Jan 2023)
attr
: add flag--source
to work with tree-ishSigned-off-by: Karthik Nayak
Signed-off-by: Toon Claes
Co-authored-by: [email protected]
The contents of the
.gitattributes
files may evolve over time, but "git check-attr
"(man) always checks attributes against them in the working tree and/or in the index.
It may be beneficial to optionally allow the users to check attributes taken from a commit other than HEAD against paths.Add a new flag
--source
which will allow users to check the attributes against a commit (actually any tree-ish would do).When the user uses this flag, we go through the stack of
.gitattributes
files but instead of checking the current working tree and/or in the index, we check the blobs from the provided tree-ish object.
This allows the command to also be used in bare repositories.Since we use a tree-ish object, the user can pass "--source HEAD:subdirectory" and all the attributes will be looked up as if subdirectory was the root directory of the repository.
We cannot simply use the
<rev>:<path>
syntax without the--source
flag, similar to how it is used ingit show
(man) because any non-flag parameter before--
is treated as an attribute and any parameter after--
is treated as a pathname.The change involves creating a new function
read_attr_from_blob
, which given the path reads the blob for the path against the provided source and parses the attributes line by line.
This function is plugged intoread_attr()
function wherein we go through the stack of attributes files.
git check-attr
now includes in its man page:
'git check-attr' [--source <tree-ish>] [-a | --all | <attr>...] [--] <pathname>...
'git check-attr' --stdin [-z] [--source <tree-ish>] [-a | --all | <attr>...]
git check-attr
now includes in its man page:
--source=<tree-ish>
Check attributes against the specified tree-ish.
It is common to specify the source tree by naming a commit, branch or tag associated with it.
If you are using a sparse checked out repository though, make sure to use Git 2.43 (Q4 2023), which teaches "git check-attr
"(man) work better with sparse-index.
See commit f981587, commit 4723ae1, commit fd4faf7 (11 Aug 2023) by Shuqi Liang (none
).
(Merged by Junio C Hamano -- gitster
-- in commit 354356f, 29 Aug 2023)
attr.c
: read attributes in a sparse directoryHelped-by: Victoria Dye
Signed-off-by: Shuqi Liang
Before this patch,
git check-attr
(man) was unable to read the attributes from a.gitattributes
file within a sparse directory.
The original comment was operating under the assumption that users are only interested in files or directories inside the cones.
Therefore, in the original code, in the case of a cone-mode sparse-checkout, we didn't load the.gitattributes
file.However, this behavior can lead to missing attributes for files inside sparse directories, causing inconsistencies in file handling.
To resolve this, revise '
git check-attr
' to allow attribute reading for files in sparse directories from the corresponding.gitattributes
files:1.Utilize
path_in_cone_mode_sparse_checkout()
andindex_name_pos_sparse
to check if a path falls within a sparse directory.2.If path is inside a sparse directory, employ the value of
index_name_pos_sparse()
to find the sparse directory containing path and path relative to sparse directory.
Proceed to read attributes from the tree OID of the sparse directory usingread_attr_from_blob()
.3.If path is not inside a sparse directory,ensure that attributes are fetched from the index blob with
read_blob_data_from_index()
.Change the test 'check-attr with pathspec outside sparse definition' to '
test_expect_success
' to reflect that the attributes inside a sparse directory can now be read.
Ensure that the sparse index case works correctly forgit check-attr
to illustrate the successful handling of attributes within sparse directories.
Another way, with Git 2.44 (Q1 2024), the builtin_objectmode
attribute is populated for each path without adding anything in .gitattributes
files, which would be useful in magic pathspec, e.g., ":(attr:builtin_objectmode=100755)"
to limit to executables.
See commit 2232a88 (16 Nov 2023) by Joanna Wang (joannajw
).
(Merged by Junio C Hamano -- gitster
-- in commit 3e85584, 12 Jan 2024)
attr
: add builtin objectmode values supportSigned-off-by: Joanna Wang
Gives all paths builtin objectmode values based on the paths' modes (one of 100644, 100755, 120000, 040000, 160000).
Users may use this feature to filter by file types.
For example a pathspec such as ':(attr:builtin_objectmode=160000)
' could filter for submodules without needing to havebuiltin_objectmode=160000
to be set in.gitattributes
for every submodule path.These values are also reflected in
git check-attr
(man) results.
If thegit_attr_direction
is set toGIT_ATTR_INDEX
orGIT_ATTR_CHECKIN
and a path is not found in the index, the value will be unspecified.This patch also reserves the
builtin_*
attribute namespace for objectmode and any future builtin attributes.
Any user defined attributes using this reserved namespace will result in a warning.
This is a breaking change for any existingbuiltin_*
attributes.
Pathspecs with somebuiltin_*
attribute name (excludingbuiltin_objectmode)
will behave like any attribute where there are no user specified values.
gitattributes
now includes in its man page:
RESERVED BUILTIN_* ATTRIBUTES
builtin_*
is a reserved namespace for builtin attribute values. Any user defined attributes under this namespace will be ignored and trigger a warning.
builtin_objectmode
This attribute is for filtering files by their file bit modes (40000, 120000, 160000, 100755, 100644). e.g. ':(attr:builtin_objectmode=160000)'.
You may also check these values withgit check-attr builtin_objectmode -- <file>
.
If the object is not in the indexgit check-attr --cached
will return unspecified.
So you can do a git check-attr $check_opts builtin_objectmode -- "$path"
(no .gitattributes
needed): a 755 indicates only an executable file.
And while many executable files are binaries (e.g., compiled programs), some are text-based scripts (e.g., shell scripts, Python scripts).
blazee adds in the comments
This has the same problem as the other answer (on Git 2.39 at least) - does not work without file being in
.gitattributes
.
I agree: git check-attr
relies on .gitattributes
to resolve attributes like binary
or text
. If a file is not explicitly mentioned in .gitattributes
, git check-attr
will not return any meaningful information, and Git would need to fall back to its heuristic:
(From the 2009 question and 2017 answer "How do I distinguish between 'binary' and 'text' files?")
Files with null bytes (\0
) in the first 8 KB are considered binary.
file_to_check="somefile"
if grep -qI . "$file_to_check"; then
echo "$file_to_check is treated as text (heuristic)"
else
echo "$file_to_check is treated as binary (heuristic)"
fi
You also have git diff
for a practical fallback: Binary files are reported with Binary files differ
, while text files will show inline diffs.
git diff HEAD -- file
But I would still suggest explicitly to set attributes for critical files to avoid ambiguity.
*.dat text
binary.dat binary
Upvotes: 3
Reputation: 566
git grep -I --name-only --untracked -e . -- ascii.dat binary.dat ...
will return the names of files that git interprets as text files.
The trick here is in these two git grep parameters:
-I
: Don’t match the pattern in binary files.-e .
: Regular expression match any character in the fileYou can use wildcards e.g.
git grep -I --name-only --untracked -e . -- *.ps1
Upvotes: 41
Reputation: 5927
Use git check-attr --all
.
This works regardless of if the file has been staged/committed or not.
Tested on git version 2.30.2.
Assuming you have this in .gitattributes
.
package-lock.json binary
There is this output.
git check-attr --all package-lock.json
package-lock.json: binary: set
package-lock.json: diff: unset
package-lock.json: merge: unset
package-lock.json: text: unset
For normal files, there is no output.
git check-attr --all package.json
Upvotes: 2
Reputation: 181
# considered binary (or with bare CR) file
git ls-files --eol | grep -E '^(i/-text)'
# files that do not have any line-ending characters (including empty files) - unlikely that this is a true binary file ?
git ls-files --eol | grep -E '^(i/none)'
# via experimentation
# ------------------------
# "-text" binary (or with bare CR) file : not auto-normalized
# "none" text file without any EOL : not auto-normalized
# "lf" text file with LF : is auto-normalized when gitattributes text=auto
# "crlf" text file with CRLF : is auto-normalized when gitattributes text=auto
# "mixed" text file with mixed line endings : is auto-normalized when gitattributes text=auto
# (LF or CRLF, but not bare CR)
Source: https://git-scm.com/docs/git-ls-files#Documentation/git-ls-files.txt---eol https://github.com/git/git/commit/a7630bd4274a0dff7cff8b92de3d3f064e321359
Oh by the way: be careful with setting the .gitattributes
text attribute e.g. *.abc text
. Because in that case all files with *.abc
will be normalized, even if they are binary (internal CRLF found in the binary would be normalized to LF). This is different from the auto behaviour.
Upvotes: 16
Reputation: 99
@bonh gave a working answer in a comment
git diff --numstat 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- | grep "^-" | cut -f 3
It shows all files which git interprets as binaries.
Upvotes: 5
Reputation: 224591
builtin_diff()
1 calls diff_filespec_is_binary()
which calls buffer_is_binary()
which checks for any occurrence of a zero byte (NUL “character”) in the first 8000 bytes (or the entire length if shorter).
I do not see that this “is it binary?” test is explicitly exposed in any command though.
git merge-file
directly uses buffer_is_binary()
, so you may be able to make use of it:
git merge-file /dev/null /dev/null file-to-test
It seems to produce the error message like error: Cannot merge binary files: file-to-test
and yields an exit status of 255 when given a binary file. I am not sure I would want to rely on this behavior though.
Maybe git diff --numstat
would be more reliable:
isBinary() {
p=$(printf '%s\t-\t' -)
t=$(git diff --no-index --numstat /dev/null "$1")
case "$t" in "$p"*) return 0 ;; esac
return 1
}
isBinary file-to-test && echo binary || echo not binary
For binary files, the --numstat
output should start with -
TAB -
TAB, so we just test for that.
1
builtin_diff()
has strings like Binary files %s and %s differ
that should be familiar.
Upvotes: 57
Reputation: 11
At the risk of getting slapped for poor code quality, I'm listing a C utility, is_binary, built around the original buffer_is_binary() routine in the Git source. Please see internal comments for how to build and run. Easily modifyable:
/***********************************************************
* is_binary.c
*
* Usage: is_binary <pathname>
* Returns a 1 if a binary; return a 0 if non-binary
*
* Thanks to Git and Stackoverflow developers for helping with these routines:
* - the buffer_is_binary() routine from the xdiff-interface.c module
* in git source code.
* - the read-a-filename-from-stdin route
* - the read-a-file-into-memory (fill_buffer()) routine
*
* To build:
* % gcc is_binary.c -o is_binary
*
* To build debuggable (to push a few messages to stdout):
* % gcc -DDEBUG=1 ./is_binary.c -o is_binary
*
* BUGS:
* Doesn't work with piped input, like
* % cat foo.tar | is_binary
* Claims that zero input is binary. Actually,
* what should it be?
*
* Revision 1.4
*
* Tue Sep 12 09:01:33 EDT 2017
***********************************************************/
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#define MAX_PATH_LENGTH 200
#define FIRST_FEW_BYTES 8000
/* global, unfortunately */
char *source_blob_buffer;
/* From: https://stackoverflow.com/questions/14002954/c-programming-how-to-read-the-whole-file-contents-into-a-buffer */
/* From: https://stackoverflow.com/questions/1563882/reading-a-file-name-from-piped-command */
/* From: https://stackoverflow.com/questions/6119956/how-to-determine-if-git-handles-a-file-as-binary-or-as-text
*/
/* The key routine in this function is from libc: void *memchr(const void *s, int c, size_t n); */
/* Checks for any occurrence of a zero byte (NUL character) in the first 8000 bytes (or the entire length if shorter). */
int buffer_is_binary(const char *ptr, unsigned long size)
{
if (FIRST_FEW_BYTES < size)
size = FIRST_FEW_BYTES;
/* printf("buff = %s.\n", ptr); */
return !!memchr(ptr, 0, size);
}
int fill_buffer(FILE * file_object_pointer) {
fseek(file_object_pointer, 0, SEEK_END);
long fsize = ftell(file_object_pointer);
fseek(file_object_pointer, 0, SEEK_SET); //same as rewind(f);
source_blob_buffer = malloc(fsize + 1);
fread(source_blob_buffer, fsize, 1, file_object_pointer);
fclose(file_object_pointer);
source_blob_buffer[fsize] = 0;
return (fsize + 1);
}
int main(int argc, char *argv[]) {
char pathname[MAX_PATH_LENGTH];
FILE *file_object_pointer;
if (argc == 1) {
file_object_pointer = stdin;
} else {
strcpy(pathname,argv[1]);
#ifdef DEBUG
printf("pathname=%s.\n", pathname);
#endif
file_object_pointer = fopen (pathname, "rb");
if (file_object_pointer == NULL) {
printf ("I'm sorry, Dave, I can't do that--");
printf ("open the file '%s', that is.\n", pathname);
exit(3);
}
}
if (!file_object_pointer) {
printf("Not a file nor a pipe--sorry.\n");
exit (4);
}
int fsize = fill_buffer(file_object_pointer);
int result = buffer_is_binary(source_blob_buffer, fsize - 2);
#ifdef DEBUG
if (result == 1) {
printf ("%s %d\n", pathname, fsize - 1);
}
else {
printf ("File '%s' is NON-BINARY; size is %d bytes.\n", pathname, fsize - 1);
}
#endif
exit(result);
/* easy check -- 'echo $?' after running */
}
Upvotes: 1
Reputation: 31441
I don't like this answer, but you can parse the output of git-diff-tree to see if it is binary. For example:
git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- MegaCli
diff --git a/megaraid/MegaCli b/megaraid/MegaCli
new file mode 100755
index 0000000..7f0e997
Binary files /dev/null and b/megaraid/MegaCli differ
as opposed to:
git diff-tree -p 4b825dc642cb6eb9a060e54bf8d69288fbee4904 HEAD -- megamgr
diff --git a/megaraid/megamgr b/megaraid/megamgr
new file mode 100755
index 0000000..50fd8a1
--- /dev/null
+++ b/megaraid/megamgr
@@ -0,0 +1,78 @@
+#!/bin/sh
[…]
Oh, and BTW, 4b825d… is a magic SHA which represents the empty tree (it is the SHA for an empty tree, but git is specially aware of this magic).
Upvotes: 22
Reputation: 28435
You can use command-line tool 'file' utility. On Windows it's included in git installation and normally located in in C:\Program Files\git\usr\bin folder
file --mime-encoding *
See more in Get encoding of a file in Windows
Upvotes: -12