Organizing Code Into Git SubModules

Question

I would like to know whether Git submodules are an appropriate organization for some code that I currently keep under RCS, and if so, how the submodules should be organized.

General outline of modules

Suppose I have a collection of library modules (maybe libraries, maybe parts of single library; that's one item up for discussion). Suppose some of those modules are base modules, and other modules depend on the base modules. All of these modules are intended to be used by yet other packaged software (programs), which would presumably include an appropriate selection of these packages as submodules.

To make it concrete, the library modules are:

stderr — standardized error reporting routines (not dependent on other modules).
filter — file filter programs (like grep or cat): uses stderr.
debug — debug trace support: uses stderr
phasedtest — unit code testing: uses filter, debug and stderr directly.
rational - a rational number arithmetic package that uses phasedtest for its test code, but is independent of phasedtest and its dependencies otherwise.

Many other programs use stderr. Quite a lot of those use also use filter (and all code that uses filter also uses stderr directly), but there are quite a lot of programs that do use stderr but don't use filter. Some programs use debug; essentially all those programs also use stderr directly, but they may or may not use filter directly. Unit test programs using phasedtest may or may not use stderr, filter and debug directly (they're more likely to use stderr than the others), but phasedtest itself needs them so such programs always use those modules indirectly. Some programs may use rational; usually they will use stderr too (nearly everything written by me uses stderr), but those programs don't directly use phasedtest themselves, in general.

Just for clarification: at the moment, these potential Git modules and submodules are not in Git at all; most of them have extensive (10-30 year) histories in RCS (SCCS prior to Y2K), which will be preserved when they are transitioned to Git. The intention is to get all the repos into GitHub in due course. In general, these modules are all fairly stable. They do get revised or extended, but not necessarily every year. Sometimes, three or more years go by without changes to some of them. I have a build/distribution system where the files that make up what might become submodules are pulled into the distribution of the larger program when that is being prepared for release. During normal (single-person) development, the material lives in a library with hundreds of source files built into a single (static) library (in $HOME/lib), and a single header directory ($HOME/inc, analogous to, but wholly separate from, either /usr/include or /usr/local/include).

I'm seeking to get the structure "right" — sufficiently right that I won't regret what I've done — before transitioning them to Git. I still have version stamping and tagging issues to resolve; that's a whole separate bag'o'worms and not part of this question.

How should submodules be organized?

From my understanding of submodules, it appears that:

stderr should be in its own repository.
filter should be in its own repository with stderr as a submodule.
debug should be in its own repository with stderr as a submodule.
phasedtest should be in its own repository with:
- debug as one submodule
- filter as one submodule
- but should it also include stderr as a direct submodule, or should it use the version of stderr from the nested submodules (the stderr inside debug and/or the stderr from inside filter)?
rational should be in its own repository with phasedtest as a submodule (and whatever sub-submodule organization comes with phasedtest).

Issues arising

Both filter and debug independently need the stderr submodule (but they're unlikely to be depend significantly on any particular version of stderr -- almost any working version at release level 10 will suffice). So, they both need a version of stderr in a submodule.
How many libraries: should there be? Options include:
- Should there be three separate libraries: libstderr, libdebug, and libfilter?
- Or should libfilter include the material from stderr, and should libdebug include the material from stderr (two libraries)?
- Or should there be a single composite library libjlss with elements of stderr, debug and filter in it?
- Does the answer vary if the libraries are shared rather than static?
Should the phasedtest code be organized as a fourth library containing the modules stderr, filter and debug as submodules (so that stderr will appear three times, once as a direct dependency and twice as a dependency of debug and filter), or should it be a smaller library that requires linking with the three separate dependent libraries?
Since the rational module only requires phasedtest for testing, it won't install the phasedtest library or libraries. But it will need them available for testing. Should it require the pre-installed phasedtest library (libraries), or should it be self-contained and have the necessary code for testing as part of its distribution?
Programs using rational might also use stderr (probably would), but might or might not use debug and filter, and would be unlikely to use phasedtest except for unit testing their own components.

Main questions

Are Git submodules the right way to go, or should I be looking at an alternative organization?
Assuming that Git submodules are appropriate, how would the Git repositories be best organized?

Auxilliary questions

Is there a minimum sensible size for a repository?
Is there a maximum sensible number of submodules for a single repository?
Does it matter if a single submodule is a sub-submodule of a number of of submodules used by a single repository?
Is there a conventional directory structure for submodules? All directories directly in the top-level directory, or some in standard directory name in the root directory, or in quasi-random locations in the superproject directory hierarchy?
Are there any glaring gotchas that I've not spotted?

larsks · Accepted Answer

Your first two questions ("are git submodules appropriate?" and "how should I organize them?") aren't really a good fit for stackoverflow: the answers are going to mostly be matters of opinion, and it would be hard to identify any single answer as "correct".

Your auxiliary questions are slightly more addressable:

Is there a minimum sensible size for a repository?

Not really, no.

Is there a maximum sensible number of submodules for a single repository?

Again, no, but before creating a monster repository with hundreds of submodules make sure you are familiar working with them first. People have different opinions on how best to manage submodules. Here is one person who has spent some time thinking about. I don't agree with all his ideas, but it is at least a way to start thinking about the issue.

Does it matter if a single submodule is a sub-submodule of a number of of submodules used by a single repository?

Not really, no, although if you have multiple instances of a repository scattered about your sources you are probably going to run into issues of version skew (e.g., one is at version A and another is at version B and another is at version C) unless you are very careful.

Is there a conventional directory structure for submodules? All directories directly in the top-level directory, or some in standard directory name in the root directory, or in quasi-random locations in the superproject directory hierarchy?

There is not, but typically you will pick something that works for you and stick with it. I have seen many projects that place submodules into a lib or modules directory, while others do place them at the top-level.

Are there any glaring gotchas that I've not spotted?

Remember that when checked out as a submodule, the current HEAD is managed by the parent repository. That is, if you cd into a submodule, make changes, push them, and then in the parent project run git submodule update, you will roll back the local copy of your submodule to whatever commit is recorded in the parent.

It is for this reason that I generally treat submodules as read-only instances of a repository that only ever get updated by running git pull (followed by a subsequent commit in the parent repository). I only edit files in the standalone checkout of the repository.

You need to train yourself to regularly run git submodule update after pulling new changes into the parent repository (in case those changes included new versions of your submodules).