user3834142
user3834142

Reputation: 561

Create new git repository inside subfolder by git init and ignore in the main git repo

I have tried to understand git submodule. It seems complicated (I don't have any experience with submodule). It might be worth to invest some time. But, right now I am having some questions related to git submodule.

Actually what I am trying is:

Is this the right way to create a sub-repository (or possibly right direction)?

Is git submodule kind of the same as I did like above?

Thanks in advance.

Upvotes: 4

Views: 2235

Answers (1)

torek
torek

Reputation: 489748

Git does not define "sub-repository", so you can define it however you like.

Git does define what it means to list a directory in a .gitignore, and this particular pattern works fine. What you have done is set up two repositories that are not related to each other at all.

In a typical setup, these two repositories would live side-by-side:

$HOME/
      [various dot files, personal files, etc]
      src/
          project-A/
                    .git/
                         [all of Git's files]
                    .gitignore
                    README.md
                    [various files]
          project-B/
                    .git/
                         [all of Git's files]
                    .gitignore
                    README.md
                    [various files]

All you have done is move project-B inside project-A's work-tree. Remember that any standard repository has these three parts:

  • The .git repository itself, full of files that Git uses for Git's own purposes. This contains the primary Git repository databases, one of which holds all the commits and other Git objects. It has all the files that Git needs to get Git's own things done. (You can look at these files any time, but in general, you shouldn't edit them directly unless you know what you're doing. But some are intended to be obvious, in which case, direct editing tends to work fine too. You might want to view the file .git/config, for instance—it has a pretty simple format, resembling Windows INI files. Looking inside the file .git/HEAD is also instructive.)

  • The work-tree, which is where you do your actual work. This is an ordinary directory tree, containing ordinary files. Git will fill it in with the files taken out of a commit, so that you can work on those files. You can also store files in here that Git knows nothing about: these files are called untracked files. Git will complain about them, unless you list them—or their name patterns—in a .gitignore file, whose primary function is to make Git stop complaining about them (and make sure you don't accidentally put them into the index, for which, see the next point).

  • The index, which is where Git keeps the next commit you're going to make. The index has a copy of every file un-frozen from the current commit, ready to go into the next commit. But these copies are in a special, Git-only format, like the files inside commits. They're not ordinary files like the ones in your work-tree.

After you work on a file in your work-tree, you can, at any point, use git add to re-copy that file from the work-tree back into the index. This turns it back into the special Git-only format, so that it is ready to go into the next commit. If the file was not in the index before, this copies it into the index (turning it into the Git-only format), so now it is in the index.

The presence of a file in the index is what makes the file tracked, so a file that's in the work-tree, but not in the index, is an untracked file. Git will complain about it, as we just said, unless you tell Git don't complain about this file via a .gitignore entry.

You can run git add with an option (like -a) or argument (like .) that says add everything or add many files. In this case, git add checks any currently-untracked file—any file that's not in the index right now—against the .gitignore settings too, and won't add the file, so that it will stay untracked.

Thus, what .gitignore means is not ignore this file but rather if this file is untracked, don't complain about it being untracked, and don't automatically copy it into the index when I en-masse add or update lots of files to the index so that they're ready to go into the next commit.

The index itself is actually a file, or sometimes more than one file, inside .git, but it's worth listing separately for two reasons. The first is that it's so important and so in-your-face, even though you can't see it. The second is that Git now supports git worktree add to create additional work-trees. Each work-tree you add has its own index. That is, the index and work-tree are paired: there's only one repository, and with that one repository, you get one index-and-work-tree for free. You can then git worktree add a second index-and-work-tree, then a third index-and-work-tree, and so on. Obviously, the index is at least logically different from the repository itself, then: it associates with the work-tree, not with the repository.

Anyway, the upshot of the above is that by putting project-B inside project-A, all you did was have project-A have tons of ignored files:

$HOME/
      [various dot files, personal files, etc]
      src/
          project-A/
                    .git/
                         [all of Git's files]
                    .gitignore                           <-- lists `/project-B/`
                    project-B/
                              .git/
                                   [all of Git's files]  <-- these files are untracked in A
                              .gitignore                 <-- and this is untracked in A
                              README.md                  <-- and so is this
                              [various files]            <-- and all of these
                    README.md
                    [various files]

Is git submodule kind of the same as I did like above?

Not really, no. It's substantially more complicated. It does, though, produce a similar work-tree structure.

When using submodules, you actually link two repositories. This linkage is essentially one-way, though.

Generally, you first you create two separate, totally independent repositories, and populate them with various commits. This is like the side-by-side setup above, with projects A and B being next to each other, rather than nested, one inside the other. In fact, very often, you don't create project B at all: someone else has already created project B as, e.g., a fancy library that you'd like to use to get some work done in your new project A. Let's use that as an example, since it's not only more common, but also the way git submodule kind of expects to be used—and if you're making project B yourself, you can just start your first stab at project B, set it up on GitHub or wherever you're going to keep it for general accessibility, and get that all out of the way before you start making project A.

So, at this point, you have a project B, which—let's assume—has its main, pride-of-place, public-access repository stored on GitHub at URL git://github.com/someuser/B / ssh://[email protected]/someuser/B / https://github.com/someuser/B.

You're now going to create project A. You can use the GitHub "create repository" clicky buttons to create it there first, then clone it to your laptop or wherever it is that you work:

<click some buttons in browser>

git clone <url> A      # so now you have A/README.md and A/.git and so on
cd A

Or, you can create it as empty on GitHub, or not even create it at all on GitHub, if you like:

mkdir A
cd A
git init

Either way, you are now in your A/ directory, which has a .git/ sub-directory holding the repository databases, an index, and a work-tree. Inside this work-tree, you can create and edit various files, use git add to copy them to the index so that they go into the next commit, and then run git commit to make new commits that freeze the index's contents into a new snapshot.

Now you're ready to link the repository B, which is still an independent Git repository all on its own, into repository A. To do this, you pick one of the URLs at which the primary version of repository B is found. Your Git will run git clone to put a new clone of repository B in your work-tree, so you must also pick the path for project B—the directory it will go into, in your current work-tree. Let's go with ssh://github.com/someuser/B here, as the URL, and project-B as the directory:

git submodule add ssh://github.com/someuser/B project-B

Your Git now runs git clone to create project-B as a clone of ssh://github.com/someuser/B. This clone is an ordinary clone in almost every way. It has a .git, for instance.1 We just call it a "submodule" because it's being used by another Git repository. It does not need to know or care about that—as far as clone B is concerned, it's just an ordinary clone.

Meanwhile, the fact that A is now using a clone of B turns A into what Git calls a superproject. In your A work-tree, the git submodule add command will create a file named .gitmodules, putting the URL and path for submodule B into this file. It will git add this file to A's index, ready for the next commit. And, last, it will git add to A's index a special entry of type gitlink, using git add project-B, which is the path for submodule B.

So, now the index in A has two all-new entries: one for .gitmodules, which—if you look at the work-tree copy of the file—now lists the submodule, and one for a "file"—really, a gitlink—named project-B. If you run git commit now in the work-tree for project A, you get a new commit in the repository database. This new commit has all the files that were already in the index (e.g., README.md and so on), plus this new file .gitmodules, plus this new gitlink thing.


1In older versions of Git, this .git in the project B submodule clone is an ordinary directory, holding the repository database for this clone, just like any ordinary Git repository work-tree with the database inside its .git. In modern Git, it is a file that tells Git where to find the .git repository database for the project B submodule clone, which Git secretes inside the .git directory for project A (i.e., in A/.git). This hiding-away of the submodule-B repository enables added worktrees for A to share the submodule-B repository, rather than just duplicating it.


The operation of a submodule

Remember, again, a submodule doesn't have to know it's a submodule: it's just a regular, ordinary, Git repository. If you now cd project-B to get into the work-tree for the project B clone and run git log and git status and various other Git commands, they all work and they all tell you what's going on in this clone of project B.

You can do work in here if you like! However, there's a hitch. The superproject Git—the one managing the work-tree for project A—has, by this point, commanded the submodule Git, here in the project-B directory work-tree, to go into detached HEAD mode. If you are not familiar with detached HEAD mode, you now need to learn about that. If you are familiar with it—or after you've gone off and learned about it and come back here—the one specific commit at which your submodule project-B work-tree has its HEAD detached is a hash ID that's recorded in the gitlink in the superproject.

In other words, when you're working in project A and you tell it go manipulate the other Git repository in the project-B directory, the way the project A Git knows which commit to use is to look in the gitlink stored in the index for project A. Let's say, for illustration, that this is 0123456....

If you do go into the project-B directory, you're in the clone of B, and you can git checkout any other commit, or even git checkout a branch in B. That changes the detached HEAD, or even attaches it to a branch, so that now repository B has a different commit checked-out. So let's say you do that:

cd project-B
git checkout develop
... do some work ...
git add ... some files ...
git commit -m 'do work on B to make it more useful for A'

You can git push the new commits back to GitHub, since project B is after all a regular old repository. But now the HEAD commit in project-B (the work-tree directory) is no longer 0123456..., now it's, say, 8088dad.... If you climb back up to the project A work-tree and run git submodule status you'll see that the Git managing A says: hey, the submodule has moved away from the detached HEAD I wanted, it's not on 0123456... any more!

That's true, but if that's what you want, it's now time to use git add to update the gitlink entry in the index for project A:

git add project-B

for instance. Now the index associated with project A's work-tree calls for commit 8088dad... in the submodule, and if you run git commit in the project A work-tree, you get a new commit for project A that says this:

git commit -m 'update submodule B to 8088dad...'

(This is not really the best commit message—it's better to say what features of submodule B you are using now, rather than just "I switched to commit 8088dad"—but this is an example and I don't even know what features you're using.)

There are other ways to do the dance of updating the submodule, then recording in a new commit in the superproject that the superproject Git should command the submodule Git to check out this particular commit. The git submodule command expresses many of these. But the point is, there are commits—many, over time—in the superproject repository, each of which says:

  • I use a submodule from url ...
  • stored at path path ...
  • and when I use it, I command the submodule Git to checkout commit hash-ID.

The first two pieces of information are recorded, in the commits in project A, in a file named .gitmodules. Each commit has its own copy of that file (though as usual, if a million commits use the same version of the file, the repository has only one copy of that version). The last piece of information is recorded directly in the commits in project A: each one saves one raw hash ID, giving the commit that should be git checkout-ed in the submodule path.

Summary

The purpose of git submodule is to allow you to specify, in your superproject Git, that you depend on some other Git repository. You record the URL of that repository that you want new clones to use to git clone the submodule. You record the path you want your superproject Git to use, to store the clone into. And, with each commit in the superproject, you record the specific submodule commit for that submodule, so that cloning the superproject, then checking out some specific, historical commit in that clone, also clones the submodule and checks out the correct historical commit in the submodule-clone.

This means that the superproject Git now depends on the submodule: although the submodule clone is controlled via superproject Git commands (making the submodule a sort of slave, as it were), the superproject itself is no longer independent. It needs the help of that submodule Git, in order for the superproject to feel whole. And, because the slave-like submodule is a clone, this doesn't stop anyone who manages the original version of the submodule from doing anything they want with that repository. That even includes ripping commits away from the origin of the submodule clone, and if the origin's original commits have gone away, the hash IDs stored in the superproject's commits are now useless.

This doesn't mean don't do this, it just means know what sort of dependencies you're getting into when you do this. If you want your project-A to be independent of project-B's existence and exact commit hash IDs, don't use a submodule. If you are OK with your project-A being dependent on project-B, and want to make it convenient to use new features from project-B as they appear, git submodule is a good fit.

Upvotes: 5

Related Questions