jan.sende
jan.sende

Reputation: 860

How can I clang-format my WHOLE git history?

I have now finished a small library of mine. When I started using it, I didn't know about clang-format. Now I would like to format the whole repository with it. I know that this breaks other peoples repositories as the commit hashes changes. However, as no one is using my library yet, this is fine with me.

Thus, what do I have to do to run clang-format for every commit in my history?

Upvotes: 5

Views: 2545

Answers (2)

torek
torek

Reputation: 487725

Git comes with a git filter-branch command that is a tool that helps with this kind of task. Note that git filter-branch itself does not do the job: it is merely a tool you can use so that you can do the job. You must still write your own commands. The one you'll probably use in the end is:

git filter-branch --tree-filter '<some command here>' --tag-name-filter cat -- --all

What filter-branch does

There's a basic problem here: no commit, once made, can ever be altered in any way. Nothing about the commit can change: not the name of the person who made it, not the date-and-time stamps, not the snapshot, and not the raw hash ID of its parent commit(s). So git filter-branch does not do that.

What it does instead is extract each commit (from some set of commits—in your case, you want this set to be all commits), one at a time, then run some arbitrary, user-specified command(s) on the extracted commit. Whatever this does, filter-branch then makes a new commit from the result.

If the new commit is exactly, totally, 100% bit-for-bit identical to the original commit, this actually re-uses the original commit. Otherwise, it makes a new commit with a new and different hash ID.

Once you've made a new and different commit, each subsequent commit will generally be at least slightly different: it will have a different parent. The filter-branch tool takes care of this reparenting process for you. So the two hard jobs it does are:

  • extract commit, run filter(s), and recommit
  • updating parent linkage as appropriate

The remaining hard job is of course writing and running the filter(s). That one, filter-branch leaves to you.

The --tree-filter is probably the easiest filter to use, and is therefore the one you want. It's worth noting in passing that --index-filter is much faster—but it's much harder to work with, if your job is to modify the snapshot in each commit in some way. Filter-branch has a lot of filtering options because --tree-filter is the slowest filter and because it's only good for changing the snapshots. The --msg-filter can edit or replace the message text in each commit, for instance. As long as you want to run clang-format over all the files in each snapshot, though, stick with --tree-filter.

How the command line part works, in more detail

Let's take a brief look at how this works in practice, starting with an example in which there are just three commits. These three commits have big ugly hash IDs but we'll call them A, B, and C for simplicity. You start with:

A <-B <-C   <-- master

The branch name master holds the hash ID of commit C, so that we (and Git) can see which is the last commit. Commit C itself holds the hash ID of commit B, and commit B holds the hash ID of commit A, so that Git can work backwards from the last commit to the first. Commit A has no parent because it's the first, so this lets the follow-everything-backwards action stop.

To run git filter-branch you might use:

git filter-branch --tree-filter '<command to run>' -- master

The thing at the very end—master—is the branch name you want filter-branch to use when it lists all the commits it should operate on. That is, it will start at master and work backwards until it can go no further backwards. It will then copy each of these commits, applying the filter, and re-commit. When it's done, the one branch name it will update is master.

Using --all tells it to start with every branch (and tag and other reference—this can misbehave on the stash ref and sometimes --branches --tags might be better, but --all is traditional, at least). We'll come back to the --tag-name-filter option later too. For now let's just go with master.

The -- before master is to separate the part where you put the branch names from the rest of the options, some of which might conceivably resemble valid branch names. That's all it is: just boilerplate to mark "end of filter options, start of branch names".

Last, lets look at --tree-filter without looking at how to write a tree-filter. That just means: run the tree filter. So filter-branch will extract each commit, into a temporary directory that holds nothing but the committed files. This temporary directory does not have a .git subdirectory, and is not your work-tree. (It's actually a subdirectory of the -d directory you pass, or by default, a sub-directory of a temporary directory that filter-branch makes.) Your tree filter should:

  • apply whatever change you want
  • to every file in its current working directory
  • and recursively, to every file in every sub-directory of the current directory

If you wanted to, for instance, insert a header line in every file, you might use:

find . -type f -print | xargs <command to insert header line in every file>

You might put this command into a script, for ease of testing before use. If clang-format has the right options (which it probably does) you might not need a script at all, and can just specify:

--tree-filter 'clang-format <options>'

but either way, what filter-branch will do is use the shell's built in exec to run the tree-filter. You must therefore make sure that your command consists of valid shell commands, and doesn't have a return or exit shell command in it (at least not without first spawning a subshell). If the command you're going to run is a script you've written, make sure that this script can be found via $PATH, or provide the full path name of the script:

--tree-filter "sh $HOME/scripts/filter-script.sh"

for instance.

Let's watch a simple filter in operation

Let's assume that commit A has one file in it, README.md. Let's assume commit B adds a new foo.cc file that will be reformatted, and that commit C modifies README.md without changing foo.cc at all. Your filter only changes any .cc and .h files, not the README.md. So, first, filter-branch itself enumerates all the commits, putting them in an appropriate order: A, then B, then C, in this case.

The tree-filter operation now:

  • extracts commit A;
  • runs your filter/script/command in the temporary directory holding the one file README.md;
  • makes a new commit from whatever your command leaves in the temporary directory.

Since your command doesn't touch README.md, the new commit is exactly, 100%, bit for bit identical to the original A. Filter-branch therefore re-uses the original commit A.

Now filter-branch moves to commit B. It extracts B's two files into the (now empty) temporary directory and runs your command. This time your command alters foo.cc, though it still leaves README.md alone. So now filter-branch makes a new commit with the modified foo.cc. Re-using the original commit's author name and email and so on keeps the original metadata, but now the snapshot is changed, so now we get a new and different hash ID we'll call B':

A--B--C   <-- [original master]
 \
  B'   [in progress]

Filter-branch now moves on to commit C. It extracts all of its files into the (re-emptied) temporary directory, so you have the same two files. Your filter now modifies foo.cc in the same way it did when operating on the contents of commit B. Filter-branch makes a new commit. The new commit's snapshot has a modified foo.cc and the same README.md as in C—the new foo.cc matches that in B' instead—and it has a new parent, B', instead of B: this last part is what filter-branch handles for you. So now we have:

A--B--C   <-- [original master]
 \
  B'-C'   [in progress]

At this point, we've run out of commits to copy, so filter-branch does its last couple of tricks:

  • If there are tags that pointed to existing commits, and you specified a --tag-name-filter, Git makes new tags that point to the copies of those existing commits. Any tag that pointed to A can be left alone, but if a tag pointed to B, filter-branch copies it to a new tag that points to B'; if a tag pointed to C, filter-branch copies that to a new one that points to C'. The names of these new tags are from the --tag-name-filter: the old name goes into the filter, and what comes out is the new tag name.

    If you have no tags, this is all irrelevant.

  • Then, for each branch you named in the branch section of the command line, filter-branch stores the hash ID of the last copied commit into that branch. So here, filter-branch sets the name master to point to C'.

In case of any problem(s), filter-branch copies all the original branch and tag names to refs/original/: the old master becomes refs/original/refs/heads/master. If all has gone well, you eventually want to throw away the refs/original/ names.

The final drawing of the above would then be:

A--B--C   <-- refs/original/refs/heads/master
 \
  B'-C'   <-- master

As in Schwern's answer, you might want to be able to recover if everything goes horribly wrong. A way to do that is to run filter-branch on a copy (e.g., clone) of the repository, instead of on the original. Another way to do that is to note that you can always force all the updated refs back to the way they are as saved in refs/original/ (but that often requires a bit of programming).

Upvotes: 10

Schwern
Schwern

Reputation: 164629

Before you get started rewriting history, I'd recommend tagging your current commit. This will allow you to return to your original version should something go horribly wrong. Or copy your whole repo, just in case.

We rewrite history in bulk with git-filter-branch. This is a bit of a nuclear Swiss army chainsaw. We'll use a --tree-filter to rewrite the directories ("tree") and files. --all says to do all referenced commits (ie. all the branches and tags) not just the ones reachable from your current checkout.

git filter-branch --tree-filter your_rewrite_command --all

This checks out each commit, runs your_rewrite_command, and rewrites the commit with the result.

I'd recommend writing a little shell script to do your rewriting and test it out before running git-filter-branch. Use git ls-files to get a list of all the files in the commit and run clang-format on each.

Upvotes: 1

Related Questions