Olivier Cailloux
Olivier Cailloux

Reputation: 1056

Check whether anything new in the local git history compared to remote

I want to (programmatically) make sure that my local copy of a remote git history is accurate, that is, that the local git history contains exactly the same history as the remote git history (no more, no less).

A sure way of achieving this would be to delete my local copy and clone the remote again. But I’d like to save time and bandwith when it’s relatively easy to do so (in the typical case, nothing will have changed since the last run of my program; thus it would be unnecessary to download everything again at each run).

(I program in Java using JGit, but an answer that uses the git command line would be fine as well, I suppose, as it should be easy to translate to a Java program.)

I know how to fetch programmatically, and I have checked that it works in the simple case of a single master branch that tracks the origin/master branch. But I am not sure that a simple git fetch will necessarily fetch everything that is on the remote that I do not have locally, as I suppose it depends on the tracking status.

Conversely, I ignore how to check simply that there is not a more recent commit locally that has not been pushed to the remote yet.

I want to guarantee that the local history is the same as the remote one even in case someone has manually fiddled with the local git copy (for example, has configured odd tracking status), as far as is reasonably possible. I understand that it is impossible to guarantee anything if facing adversarial behavior, barring re-downloading everything and comparing everything (someone could maliciously change the origin/… pointers in the local copy, for example). I work under the hypothesis that my user is not maliciously trying to make my program crash or misbehave. I simply want to be able to warn my user if there is reason to believe that the local copy my program acts on seems to have been modified compared to the remote one, and offer an option to re-download (but not bother them if this appears unnecessary).

The reason for my question is that I want to inspect the local git history, for efficiency reasons, while making sure that the effect is as if I was reading directly from the remote data.

I do not care about anything being left over in the staging area or in the work tree, I only care about the git history (in the .git folder). And I do not care to write anything locally or remotely. This is only about reading data.

Upvotes: 0

Views: 175

Answers (1)

torek
torek

Reputation: 487755

You'll have to ensure several constraints apply to the local repository:

  1. It must have only one remote, or you must only care about one remote. Otherwise you're requiring that the local repository exactly match two or more other Git repositories; if those two Git repositories don't match, it's impossible for the local Git repository to match both.

    (This constraint could be relaxed a bit if you're willing to also force all the other repositories change so that they all match, if needed.)

  2. Your local repository must be a full clone of the remote. It must not be a shallow clone, nor a single-branch clone. (This condition is something you should probably just statically set up at the time you create the local clone. It will then persist unless someone deliberately breaks it. Still, you can check for shallow-ness by testing if the file $GIT_DIR/shallow exists. There should be a git rev-parse test for this, but in many versions of Git, there isn't. See if your Git version has git rev-parse --is-shallow-repository. You can test for single-branch-ness by testing the result of git config --get-all remote.origin.fetch: compare the result for a normal repository with that for a single-branch clone to see.)

  3. You must choose some method by which to identify a local branch name, such as master, with its corresponding branch name on the chosen remote. Typically, since users like to match up their branch names, this is trivial: if the remote is named origin, each (local) branch B corresponds to origin/B.

    If you choose some other method, modify the second step below accordingly.

Then, to check whether the two repositories match up—as of the time you do the checking; remember that either repository can be modified in the nanoseconds that tick away afterward—you just do these two steps. Remember to replace the name origin with whatever name you prefer, if needed.

  1. Run:

    git fetch origin
    

    so that all of the remote-tracking names are up to date.

  2. Run something equivalent to this (written as several sections with some commentary). Note: this is entirely untested.

    TF=$(mktemp) || exit
    trap "rm -f $TF 0 1 2 3 15"  # clean up temp file on exit
    valid=true
    

    The temporary file here is just because shell pipelines force subshells, which means variable settings will not propagate back to the main shell process. In other languages you will probably not have this problem.

    # compare all local branches to their updated remote-tracking counterparts
    git for-each-ref --format='%(refname:short) %(objectname)' refs/heads > $TF
    while read branchname hash; do
        theirs=$(git rev-parse -q --verify refs/remotes/origin/$branchname) || {
            # git rev-parse failed: they don't have this branch name at all
            valid=false
            break
        }
        test $hash = $theirs || { valid=false; break; }
    done < $TF
    if $valid; then
        echo "upstream repository has all the local branches and they match"
    else
        echo "upstream repository does not have some branch or does not match"
        exit 1 # failure
    fi
    

    You can, of course, check everything and print the entire missing or mismatched sets, rather than stopping early.

    # now make sure a local branch exists for each of its remote-tracking counterparts
    # Note: some or all of this is redundant, but it's easier to re-test here
    git for-each-ref --format='%(refname:short) %(objectname)' refs/remotes/origin > $TF
    while read rtname hash; do
        branchname=${rtname#refs/remotes/origin/}
        ours=$(git rev-parse -q --verify refs/heads/$branchname) || {
            # git rev-parse failed: we don't have this branch name at all
            valid=false
            break
        }
        # no need to test hash: we did that already
    done < $TF
    if $valid; then
        echo "and, we have branches for all their branches"
        echo "so we must be good"
        exit 0
    else
        echo "we're missing some local branch names to match theirs"
        exit 1
    fi
    

    As before, you can do a full cross-check.

It's probably reasonable to allow some situations: e.g., there's no requirement that some local branch B exist just because origin/B exists. So you might want to omit the last check entirely.

The test that local branch B matches origin/B is ridiculously simple: the hash IDs either match, or they don't. If they match, the history is the same. If not, it's not. The reason for this is that the history in a Git repository is the set of commits. A branch name simply contains the raw hash ID of the last commit that Git should consider to be "part of the branch". All earlier history is determined by the commits. Commits are entirely read-only, including their parent pointers; and each commit has a unique hash ID, so if the hash IDs match, so does the history.

Upvotes: 1

Related Questions