Alan P.
Alan P.

Reputation: 3123

Confused by git submodules

I have an app and have set it up to use git. In that app I have a folder that I have setup as a git module, something like this:

/app
  .git
  .gitmodules
  ...other-files
  /...other-folders
  /feature           # this is a sub-module
    .git
    ...other-files
    /...other-folders

Something seems wrong as when I make a changes in the /feature folder, then in the root folder do git status, it lists the changes in the sub-module.

Also when I change the branch in the root folder, doing a git status in the sub-module lists lots of files. Same if I change the branch in the sub-module, git status lists lots of changed files in the root folder.

I thought of adding /feature to the root directory's .gitignore, but read that is the wrong approach?

Is this expected? Maybe I am misunderstanding submodules? The reason I added them is to give a contractor access to just a part of the codebase.

Upvotes: 1

Views: 2165

Answers (1)

torek
torek

Reputation: 487735

All I can say for sure is that based on the fact that running git submodule produced no output, plus the results from the git status commands, you are currently in a state which your index and work-tree for your main repository are not using any submodules.

One thing not to worry about is this: all commits are frozen for all time (or at least for as long as the commit itself exists), so anything that has been committed, to either the superproject or the submodule, should be recoverable as long as the superproject and submodule Git repositories still exist. You can discard commits by abandoning them, but this is at least a little bit hard, and it normally takes at least 30 days for the abandonment to take effect. You have to go to special effort to lose commits forever (which is why getting rid of accidentally-committed large or sensitive files is so hard).


It's worth thinking, here, about how submodules work. A submodule itself is just a Git repository, and other than being used by some other Git repository, a submodule is largely unaware of its submodule-ness. Meanwhile, a superproject repository acts as a superproject, referring to and thus controlling some submodule, by a slightly complicated multi-step process.

Basics of a single Git repository, without any submodule fussing-about

Let's think first about just one repository at a time. Each repository has:

  • some set of commits, each commit being identified by a unique1 hash ID;
  • some set of branch and tag names;
  • an index and a work-tree, if the repository is not one made with --bare.2

The work-tree is where you do all your work: it has your files in it, in their normal (not Git-ified) form, as ordinary read/write files. Git doesn't really use these files for much: they're just copied out into via git checkout, or copied from when you use git add. Git stores your data in its own internal, Git-ified form. Once committed, the file data—and the commits themselves—are frozen and can never be changd by anything.

The index is where you build the next commit you will make. It starts out holding copies3 of all the files that Git extracted from the commit you checked-out earlier. You can see and work with the work-tree copy. If you change the work-tree copy, you must then git add the file again. This re-Git-ifies the file's data, stuffing it into the index in ready-to-commit form. A git commit then freezes whatever is in the index into the new commit. The freezing process assigns the new commit its unique hash ID, and the new commit becomes the tip commit of the current branch.

Of course, we typically clone a Git repository from somewhere, rather than just creating one from scratch. (There are exceptions, but they're not interesting here.) To clone a repository, we run:

git clone <url> <path>

which goes to the given url, makes sure it can find a Git repository there, and then gets it and writes it out into a repository-and-work-tree that lives at the path argument. So you run:

git clone git://github.com/path/to/repo.git

which defaults the path part to repo, and wind up with repo/.git holding the repository proper, and repo/ holding the work-tree in which you'll be able to see and work-with files.


1Technically, the hash IDs for objects in any given repository need only be unique within that repository and any and all of its clones, so that clone-repositories repo-A and repo-B can tell whether they have the same object just by comparing hash IDs of objects in repo-A with those of objects in repo-B. In practice, however, these hash IDs are unique across all repositories, even if they never cross paths with one another.

2A bare repository really has just one purpose: to be a repository to which people can git push. A bare repository still has an index, just because that's how Git rolls—this index often goes unused—but it has no work-tree so that no one can do any work in it. That makes it suitable for receiving git push operations, which in a non-bare repository, might interfere with someone's ongoing work. With no place to do any ongoing work, there is never anything to interfere with.

3What's in the index is actually just a reference to the file data, plus the name—full name; Git doesn't bother with "folders" or "directories" here and a file is just named path/to/file.ext—under which the data is to be stored. The fact that the index does not hold directories is the primary reason that Git cannot store an empty directory. When you add a new file to the index, or even put an old copy back into place, if the data that go into that file are already stored as a Git object, Git just re-uses the existing object. All objects are frozen for all time, so it's quite safe to re-use an existing one. If the file data are all-new, Git makes a new object, freezing it in the process, and now the index can refer to the new object.

Sometimes, this process results in unused / unneeded objects. Git has a cleaner, the Garbage Collector or gc, that sweeps them up. Git invokes it automatically whenever that looks profitable, so that you normally need not think about it at all.


Adding submodules into the mix

Let's start by pointing out, again, that a submodule is just another Git repository. Suppose you have your existing repository in ./repo/.git as made by your git clone git://github.com/path/to/repo.git. You then:

cd repo

and it has no submodule/ directory in it, but you'd like to have a second Git repository here, at repo/submodule/, which will have a .git, etc. The submodule repository itself is at git://github.com/public/utils.git, so you'll need to run—or have Git run:

git clone git://github.com/public/utils.git submodule

while in your repo directory.

To have Git run this for you, you need to tell Git:

  • The URL for the submodule I want is spelled git://github.com/public/utils.git.
  • The path within my work-tree should be submodule.

This information—what git clone will need—goes into the .gitmodules file. This is an ordinary file, so it behaves according to the ordinary rules of commits: there's a copy in every commit. From .gitmodules, though, it gets copied to .git/config. Once copied to .git/config, Git stops using the .gitmodules file data. Changing .gitmodules and committing thus affects future clones, but not your own, where it's now in .git/config. This is another source of weirdness, but sometimes you may need to fuss with a URL, e.g., to switch from https:// to ssh://, so it can be useful.

But the URL and path are not quite enough, because git://github.com/public/utils.git is constantly evolving. The next commit on their master branch, or whatever, might not be the commit that works with your repository. The commit you've tested, that works, is, say, a123456.... So you need to tell your Git:

  • After you've clone that repository, have the submodule Git check out commit a123456....

The way Git submodules are implemented, your superproject must specify the exact commit, by hash ID.4 The place where this hash ID is stored is, however, not the .gitmodules file! (If it were, things might be easier.)

Instead, the particular submodule commit—the raw hash ID in the other Git repository—that your superproject should git checkout is stored in each superproject commit, as a special entry that Git calls a gitlink.5 Each commit that has a gitlink in it tells git submodule that it should run the appropriate git clone and git checkout commands if needed, to fill in the work-tree. These gitlinks are entries in your index, too, just like the entries for regular files.

If you need to change the commit that the superproject will check out, you first enter your submodule. It's an ordinary Git repository, after all. Then you do stuff there and make new commits as usual. Now the commit that's checked out in the submodule is, say, b789abc... rather than a123456....

To record this new hash ID, you then go back to your superproject and use git add:

git add submodule

This tells your superproject Git to update the submodule repo commit hash ID stored in your index—your superproject asks the submodule "which commit is the current commit" and it says b789abc..., for instance. So now that your index is updated, your next commit will be updated too. The new submodule hash ID will be in your next git commit.


4You can specify a branch or tag name in your .gitmodules entry, which is then copied to your .git/config as usual. But this branch or tag name is only used for particular git submodule update commands. Essentially, it's there to allow you to have your superproject git checkout some other commit for testing purposes. You then test it, and if it all goes well, git add the submodule path to update the gitlink in your index, and git commit to record the new raw hash ID. So these branch-or-tag-names are semi-useless: they don't mean what people expect them to mean, at first.

5A gitlink is in fact a tree object entry whose mode field is set to 160000, which is essentially a symlink (mode 120000) OR-ed with a tree (mode 040000). The commit refers to a top level tree, and the top level tree refers to more trees and/or to blob objects, symlinks, and/or gitlinks. If Git were able to store empty directories, it would do so using a tree object, but tree entries are not allowed in the index. When you get close to having Git put one in, Git winds up writing a gitlink object in the index instead. This is why attempts to store empty directories wind up storing broken submodules.


Complications

If you add a submodule to a repository that does not have a submodule in any of the existing, earlier commits, those existing commits continue to not have a submodule in them, because they lack this new gitlink entry. If those existing commits have files in them whose name starts with the submodule path, and you git checkout one of them, those files get extracted into the submodule, and are now ordinary tracked files: they're in the index and in your work-tree, which now extends into the paths that the other Git thinks it's controlling (but your superproject is now controlling directly, rather than asking the other Git to control).

If you git checkout a commit that has the submodule, it won't have any files whose path names start with (in this example) submodule/. So those files will be removed from the (superproject) index and (superproject) work-tree. If the submodule Git exists, and is trying to keep track of those files, the submodule Git will still have those files in its index, but since its work-tree and your superproject work-tree have this sort of overlap, you'll now (after moving back into the submodule) see those work-tree files as having gone missing.

Note that if environment variable $GIT_DIR is set, or you use --git-dir to set it, operations done anywhere in your file system may refer to the .git Git repository in your repo/ rather the .git repository in your repo/submodule/. This of course depends on exactly what argument you pass to --git-dir, or what string you put in $GIT_DIR. Your own shell-level $GIT_DIR is normally not set, except when running Git hooks. Internally, it is set for each Git command: the git front end sets it by finding the Git directory, or using the existing $GIT_DIR or your --git-dir argument. The git submodule command knows how to set it correctly for each submodule, of course.

It's kind of painful mixing old commits that have both repo/file and repo/submodule/file2 with new commits in which repo/file still exists but repo/submodule is a submodule, and only the submodule itself is supposed to fuss with repo/submodule/file2. Every time you use the superproject Git and git checkout any old commit, the superproject Git tries to "take over" the submodule work-tree (or parts of it). Every time you switch back from such a commit to one in which the submodule is in use, the submodule itself—or at least, its work-tree—has been damaged (well, has potentially been damaged, depending on overlap and whether the submodule has various untracked-and-ignored paths in it).

Upvotes: 5

Related Questions