Can I read a tree directly into a working directory going over the index

Question

I'm exploring git internals. And I'm wondering if there's a GIT command that can allow me to read tree into a working tree directly without using index. For example I've created a tree:

$ echo 'f1 content' | git hash-object -w --stdin
a1deaae8f9ac984a5bfd0e8eecfbafaf4a90a3d0

$ echo 'f2 content' | git hash-object -w --stdin
9b96e21cb748285ebec53daec4afb2bdcb9a360a

$ printf '%s %s %s	%s
' \
> 100644 blob a1deaae8f9ac984a5bfd0e8eecfbafaf4a90a3d0 f1.txt \
> 100644 blob 9b96e21cb748285ebec53daec4afb2bdcb9a360a f2.txt |
> git mktree
e05d9daa03229f7a7f6456d3d091d0e685e6a9db

And now I want to read the tree e05d9daa03229f7a7f6456d3d091d0e685e6a9db with two files f1.txt and f2.txt directly into a working directory. I know I can use the following combo:

$ git read-tree e05d9daa03229f7a7f6456d3d091d0e685e6a9db
$ git checkout-index -a

But I'm wondering if there's a single command to do that.

torek · Accepted Answer

The short answer is "no": all Git operations that read a complete tree do so into an index.

The phrase an index, as opposed to the index, is your main escape hatch that makes the long answer a qualified "yes". You can avoid using the index by using an index—as in, some alternative index instead of "the" index. You make this other index take the place of "the" index by putting the alternative index's path-name in the environment variable GIT_INDEX_FILE. And in some cases, you can bypass the index entirely, by ... well, read on. :-)

There are, I think, two main reasons that Git "wants" to read a series of trees from a commit into an index, before copying files to a work-tree. The first one has to do with resolving full path names: within a tree object, inside a Git repository, each stored sub-object—sub-tree or blob—has a mode (which is 40000 for a sub-tree),¹ a hash, and a name. The name is not the full path name of the object, though: it's just the name component, the bar part of foo/bar/baz.txt for instance.

By extracting linearly through each tree, recursing on each sub-tree, Git can build up an index in which each name stored in the index is a full path name. That is, we kick off the tree extraction with, in pseudo-code:

build_index('', top_level_tree_hash)

where build_index does this (in pseudo-Python):

def build_index(path_so_far, tree_hash):
    tree = get_object('tree', tree_hash)
    for mode, hash, name in tree:
        if mode == MODE_TREE:
            build_index(path_so_far + name + '/', hash)
        else:
            cache_this_object(path_so_far + name, mode, hash)

When the recursion finishes, the cache aspect of the index has in it all of the full path names, modes, and hashes for each non-tree object, and is ready to be extracted.

Without the index, though, if you just have a tree to read, you have no idea what the leading path-name components up to this point should be. We need the recursion above to maintain our path-names for us.

The second reason Git "wants" to read into an index has to do with the end-of-line and filtering (smudge and clean filter) processing that is done on blob objects representing files. (Blob objects representing symlinks and gitlinks need neither EOL hackery nor smudge filtering.) Git normally defers this processing to the point where the file is copied from the index to the repository. At this point, Git has the full path name of the file (because it's stored that way in the index²) and the hash ID. It looks up the appropriate EOL or filtering in the appropriate .gitattributes file(s), in the work-tree and/or index and/or globally. Work-tree files, if present, override index-only files, and attribute files "more local" to the directory holding the file override those higher up in the directory hierarchy, which again is much easier to achieve if Git has the entire index and work-tree in place as it does this. It can find the correct EOL and filter attributes easily, and apply them to the blob contents during the copy from index-stored hash, to location-in-work-tree as determined by index-stored path name.

The upshot of all of this is that to extract files "the easy way", Git needs an index, which—for the duration of the command it's running at least—acts as the index. But if you have one particular file whose path name you know in advance, and are willing to risk EOL/filtering a bit (or forego them entirely), you can use git cat-file -p or git show to extract the blob contents:

git cat-file -p [--textconv | --filters] $commithash:$fullpath

for instance. When using --textconv or --filters, you must provide a path, so if all you have is a raw hash you must use:

git cat-file -p $filteropt --path=$path $rawhash

(where $filteropt is one of the above --textconv or --filters options).

If you want the contents unfiltered, none of the above caveats apply at all. You should omit --textconv or --filters, and now git cat-file -p does not need a path name at all. Anything acceptable to git rev-parse that locates a blob object suffices, and:

git cat-file -p $hash > $path

suffices to extract the raw blob contents, writing them to $path.

¹The repository object's type is implied by the mode and is later matched against the underlying repository object's actual type. If we ignore symlinks and gitlinks, there are just two blob/file modes (100644, 100755) and one sub-tree mode (40000). A symlink or a gitlink is also represented by a blob object, so if the mode is 40000 we recurse and fetch another tree object, otherwise we have hit a leaf and write the hash, which had best represent a blob, into the cache.

²Path names in the index get compressed, so this is not entirely true. There are several index formats, though, so it's particularly complicated. It's best to think of each index/cache entry as representing a tuple.

Can I read a tree directly into a working directory going over the index

Answers (1)

Related Questions