bairog
bairog

Reputation: 3373

GIT pre-commit hook which searches non-UTF-8 encodings among modified/added files (and rejects commit if it finds any)

I'm using Git for Windows (and TortoiseGit).

My goal is to prevent commits which have at least one non-UTF-8 file among modified/added.

So any help is appreciated.

Upvotes: 4

Views: 2492

Answers (2)

bairog
bairog

Reputation: 3373

So the answer is (thanks to phd and great thanks to torek for their useful notes):

    git diff --name-only --staged --diff-filter d | xargs -I {} bash -c 
 "iconv -f utf-8 -t utf-16 {} &>/dev/null || { echo {} - is non-UTF8!; exit 1; }"

This code iterates through all files, that changed in commit (except for deleted - i.e. added, modified, copied and renamed) and checks if there is any non-UTF8 file. All found files are listed and commit is aborted.

Upvotes: 5

torek
torek

Reputation: 488183

Your existing solution is probably sufficient. It's not 100% correct though: here are the remaining issues, all of which are minor ones that you can fix later (if ever) at your leisure:

  • You need only the git diff ... --staged (or --cached), as what Git will commit is whatever files are in the index/staging-area, and git diff compares that with what's in the HEAD commit and tells you what's different there. If a copy of a file in the index differs from the copy of the file in HEAD, you should examine the index copy.

  • Technically it would be better to use git diff-index --cached here so as to not obey any of the user's git diff configuration. That is, git diff-index is a plumbing command in Git, which means it's aimed at being used from other computer programs: it runs in a completely predictable manner based on arguments only, not on any git config settings. But if you're doing this for yourself, and you configure git diff such that it breaks your own use of git diff, well, that's your own fault. :-)

  • You might also consider using a --diff-filter to exclude deleted files here. Otherwise your checker will always fail on deletion (as iconv won't be able to read the deleted file).

  • Most significant: iconv will be reading the file from the work-tree. As I noted in the first bullet point, Git is going to commit what's staged, not what's in the work-tree.

As an example—which may or may not be possible from within TortoiseGit—consider what happens if you do this:

$ git checkout master
$ printf '\300\300\300' > badfile    # put bad non-UTF-8 crud into file
$ git add badfile                    # copy file into index
$ echo 'good data' > badfile         # replace work-tree contents
$ git commit

This commit is going to commit the bad contents—the three bytes of \300 with no newline—that are in the index, but your pre-commit hook is going to run iconv -f utf-8 -t utf-16 over the contents of the good file, reading good data, that is of course good.

To fix this, your pre-commit filter must extract the data from the index for each file that is to be committed. How you go about doing that is up to you. The simplest (but perhaps slowest) method is to just extract the entire index contents to a temporary work area using git checkout-index. A better method might be to turn each in-index (in-staging-area) path name to valid index specifier (that is, path/to/file becomes :path/to/file) and use git cat-file -p $specifier | iconv ... to scan each. But all of these will be fairly inefficient, especially on Windows. For efficiency, you might want to write a Python script that uses git cat-file --batch to extract them all in one pass, and do the format-checking there.

Upvotes: 2

Related Questions