Reputation: 3105
I'm serving bare git repos from my raspberry pi. My goal is to run git fsck --full
nightly to detect file system issues early. I expect fsck to check both "object directories" and "objects", and to see output such as
pi@raspi2:/media/usb/git/dw.git $ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (14538/14538), done.
For one of my repos, no objects are checked:
pi@raspi2:/media/usb/git/ts-ch.git.borken $ git --version
git version 2.11.0
pi@raspi2:/media/usb/git/ts-ch.git.borken $ git fsck --full
Checking object directories: 100% (256/256), done.
pi@raspi2:/media/usb/git/ts-ch.git.borken $
I modified one file under /objects (a 322kB .pdf file) and ran fsck again. It showed the same message as before, and no errors.
cd objects/86/
chmod u+w f3e6e674431ab3006cbb56fddecbdb4a7724b4
echo "foosel" >> f3e6e674431ab3006cbb56fddecbdb4a7724b4
chmod u-w f3e6e674431ab3006cbb56fddecbdb4a7724b4
All repos are the same, they are bare, and have no special config:
pi@raspi2:/media/usb/git/ts-ch.git $ git config --list
core.repositoryformatversion=0
core.filemode=true
core.bare=true
Am I missing something? Why is this modified object not detected? Its SHA1 should certainly not match anymore. Thanks for any hints!
Upvotes: 7
Views: 2984
Reputation: 47032
Yes, you are missing something. Namely, you didn't corrupt the file in a way the Git pays attention to. Objects stored on disk generally start with the object type, followed by space, followed by the size (using ASCII numbers), followed by a NULL. The size states how big the object is and that's all that Git ends up reading. So tacking data to the end like that won't actually corrupt the object. If you replaced the contents of the file with something else, then you'd see the issue.
For reference, the object format details are in the Git User's Manual:
Object storage format
All objects have a statically determined "type" which identifies the format of the object (i.e. how it is used, and how it can refer to other objects). There are currently four different object types: "blob", "tree", "commit", and "tag".
Regardless of object type, all objects share the following characteristics: they are all deflated with zlib, and have a header that not only specifies their type, but also provides size information about the data in the object. It’s worth noting that the SHA-1 hash that is used to name the object is the hash of the original data plus this header, so
sha1sum
file does not match the object name for file.As a result, the general consistency of an object can always be tested independently of the contents or the type of the object: all objects can be validated by verifying that (a) their hashes match the content of the file and (b) the object successfully inflates to a stream of bytes that forms a sequence of
<ascii type without space> + <space> + <ascii decimal size> + <byte\0> + <binary object data>
.The structured objects can further have their structure and connectivity to other objects verified. This is generally done with the
git fsck
program, which generates a full dependency graph of all objects, and verifies their internal consistency (in addition to just verifying their superficial consistency through the hash).
However, there is an interesting interaction that leads me to think that git fsck
should be working harder and noticing when the file has garbage at the end. If you attempt to run git gc
on that repo, you'll end up see an error like this:
:: git gc
Counting objects: 9, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (3/3), done.
error: garbage at end of loose object '45b983be36b73c0788dc9cbcb76cbb80fc7bb057'
fatal: loose object 45b983be36b73c0788dc9cbcb76cbb80fc7bb057 (stored in .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057) is corrupt
error: failed to run repack
It seems like if git gc
can't actually run, then git fsck
should be catching the issue.
This issue is actually really simple: there are no packed objects to check. Those live in .git/objects/pack
. If you don't have any of those files, then you won't see the "Checking objects" bit.
Upvotes: 4
Reputation: 1324268
I still don't understand why git refuses to report it is checking objects in this repo,
I'm going to bring it up no the git list because I think git fsck should check things thoroughly enough that all operations should work
That might be related to those two sets of patches to come with Git 2.12 (Q1 2017): recompiling git 2.12 on your raspberry pi might yields better results now.
See below: Git 2.20 (Q4 2018) is recommended.
"
git fsck
" inspects loose objects more carefully now.
See commit cce044d, commit c68b489, commit f6371f9, commit 118e6ce, commit 771e7d5, commit 0b20f1a (13 Jan 2017) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 42ace93, 31 Jan 2017)
And:
"
git fsck --connectivity-check
" was not working at all.
See commit a2b2285, commit 97ca7ca (26 Jan 2017), commit c20d4d7 (24 Jan 2017), commit c2d17b3, commit c3271a0, commit c6c7b16, commit b4584e4, commit 1ada11e (16 Jan 2017), and commit 3e3f8bd (17 Jan 2017) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 4ba6197, 31 Jan 2017)
Update Nov. 2018: the Git 2.12 recommended above actually introduced a regression, which made "git fsck
" fall into an infinite loop while processing truncated loose objects.
See commit 98f425b, commit ccdc481, commit 5632baf (30 Oct 2018) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 879a8d4, 13 Nov 2018)
check_stream_sha1()
: handle input underflowThis commit fixes an infinite loop when fscking large truncated loose objects.
The
check_stream_sha1()
function takes anmmap
'd loose object buffer and streams 4k of output at a time, checking its sha1.
The loop quits when we've output enough bytes (we know the size from the object header), or whenzlib
tells us anything exceptZ_OK
orZ_BUF_ERROR
.The latter is expected because
zlib
may run out of room in our 4k buffer, and that is how it tells us to process the output and loop again.But
Z_BUF_ERROR
also covers another case: one in whichzlib
cannot make forward progress because it needs more input.This should never happen in this loop, because though we're streaming the output, we have the entire deflated input available in the
mmap
'd buffer. But since we don't check this case, we'll just loop infinitely if we do see a truncated object, thinking thatzlib
is asking for more output space.
Git 2.22 (Q2 2019) improves "git fsck --connectivity-only
", which did omit computation necessary to sift the objects that are not reachable from any of the refs into unreachable and dangling.
This is now enabled when dangling objects are requested (which is done by default, but can be overridden with the "--no-dangling
" option).
See commit 8d8c2a5, commit df805ed (05 Mar 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit ea32776, 20 Mar 2019)
fsck
: always compute USED flags for unreachable objectsThe
--connectivity-only
option avoids opening every object, and instead just marks reachable objects with a flag and compares this to the set of all objects. This strategy is discussed in more detail in 3e3f8bd (fsck
: prepare dummy objects for--connectivity-check
, 2017-01-17).This means that we report every unreachable object as dangling.
Whereas in a full fsck, we'd have actually opened and parsed each of those unreachable objects, marking their child objects with the USED flag, to mean "this was mentioned by another object".
And thus we can report only the tip of an unreachable segment of the object graph as dangling.You can see this difference with a trivial example:
tree=$(git hash-object -t tree -w /dev/null) one=$(echo one | git commit-tree $tree) two=$(echo two | git commit-tree -p $one $tree)
Running
git fsck
will report only $two as dangling, but with--connectivity-only
, both commits (and the tree) are reported. Likewise, using--lost-found
would write all three objects.We can make
--connectivity-only
work like the normal case by taking a separate pass over the unreachable objects, parsing them and marking objects they refer to as USED. That still avoids parsing any blobs, though we do pay the cost to access any unreachable commits and trees (which may or may not be noticeable, depending on how many you have).
Upvotes: 0