nneonneo
nneonneo

Reputation: 179687

How to list all text (non-binary) files in a git repository?

I have a repository with a lot of autogenerated source files I've marked as "binary" in .gitattributes (they are checked in because not everyone has access to the generator tools). Additionally, the repo has a lot of source-ish files in ignored directories (again, generated as part of the build processes), and a number of actual binary files (e.g. little resource files like icons).

I'd now like to find all the non-auto-generated and non-ignored files in the repo. I thought I'd just do this with find and a bunch of exclusions, but now I have a horrendous find statement with a dozen clauses (and it still doesn't perfectly do the job). git ls-files works but shows me all the binary files without differentiation, which I have to filter out.

So, I'm wondering: is there a simple command I can run which lists every file checked into the the repo, and which git considers a "text" file?

Upvotes: 31

Views: 5330

Answers (5)

CervEd
CervEd

Reputation: 4272

You use gits eol attributes to find non-binary files.

git ls-files --eol | grep '^i/lf' | cut -f 2-

This list all files that are checked in having 'LF' line-endings.

This has the advantage of using the git ls-files command, so it can easily be piped to xargs. It's also a plumbing command, so it might be faster (I haven't benchmarked).

This may be a viable alternative to using the git grep method as it appears to be more customizable in terms of what one considers binary and not.

Note that you can specify which files git should consider binary in .gitattributes. So if you add *.svg binary to .gitattributes. The git grep method respects this. The eol attribute will also respect, but not for old files already checked into the index prior to setting the attribute. But you can always add a | grep -v 'attr/-text' to exclude files that have been set as binary in the .gitattributes.

Upvotes: 3

juanmirocks
juanmirocks

Reputation: 6239

Using git ls-files and awk:

git ls-files --eol | awk -F '\t' '{if ($0 !~ /^i\/-text/) print $2}'

Note: this solution also works and returns non-binary, empty files.

Explanation:

  • --eol: show <eolinfo> and <eolattr> of files. Ref.: https://git-scm.com/docs/git-ls-files#Documentation/git-ls-files.txt---eol
  • awk -F '\t': parse and separate the piped input lines by tabs. At least with git version 2.37.2, the output format of git ls-files --eol displays 4 "humanly-readable" columns, however, only the last 4th is preceded by tab. Accordingly, if we separate by tabs, awk considers two columns.
  • (awk) if ($0 !~ /^i\/-text/): only match if the line does not start with i/-text/. This is our test to know that the file is NOT a binary file.
  • (awk) print $2: print the 2nd column, which is the file's path (as requested by the OP). Note that this solution also works for filenames containing spaces.

Acknowledgment: My answer expands on @CervEd answer (https://stackoverflow.com/a/67346778/341320) and also takes as a reference another post answer from @Quential33 (https://stackoverflow.com/a/66796286/341320)

Upvotes: 0

git grep --cached -Il ''

lists all non-empty regular (no symlinks) text files:

  • -I: don't match the pattern in binary files
  • -l: only show the matching file names, not matching lines
  • '': the empty string makes git grep match any non-empty file
  • --cached: also find files added with git add but not yet committed (optional)

Or you could use How to determine if Git handles a file as binary or as text? in a for loop with git ls-files.

TODO empty files.

Find all binary files instead: Find all binary files in git HEAD

Tested on Git 2.16.1 with this test repo.

Upvotes: 39

Cacovsky
Cacovsky

Reputation: 2546

A clever hack to achieve this: listing all non-binary files that contains carriage returns

$ git grep --cached -I -l -e $'\r'

For my case, an empty string works better:

$ git grep --cached -I -l -e $''

Took it from git list binary and/or non-binary files?.

Upvotes: 4

VonC
VonC

Reputation: 1328282

The standard method for listing non-ignored files is:

git ls-files --exclude-standard --cached

But, as you seen, it lists all versioned files.

One workaround could be to define in a separate file "exclude_binaries" an exclusion pattern in order to match all binaries that you know of.

git ls-files --exclude-standard --cached \
--exclude-from=/path/to/`exclude_binaries`

That would be a less complex find, but it doesn't provide a fully automated way to list non-binary files: you still have to identify and list them in a separate pattern file.

Upvotes: 1

Related Questions