How to disable "remap to ancestor" for tags in git-filter-branch?

Question

My problem is very similar to that question, and that answer works perfectly for me.

The only problem is with tags: I'm receiving a lot of unneeded tags in the resulting repo.

This is my command:

git filter-branch --tag-name-filter cat --prune-empty --index-filter "git rm --cached -qr --ignore-unmatch -- . && git reset -q $GIT_COMMIT -- path/to/dir1 path/to/dir2" -- --all

With option -- --all all the tags are preserved; tags pointing to skipped commits are moved to nearest ancestor commit.
Without option -- --all all the tags are lost (if not listed explicitly in the command line, of course).

I want tags pointing to skipped commits to be automatically excluded instead of moved to the nearest ancestor commit.
All other tags should be preserved.
How can I do this?

P.S.
I'd like to avoid removing unneeded tags manually prior to running git filter-branch.
There are thousands of tags in the repo.

Update:
Thanks to @torek for confirmation that there is no straightforward way.
I solved my problem by running Lua script which deletes all unneeded tags.

local list_of_dirs = "path/to/dir1 path/to/dir2" -- separated by space

local useful = {}
for line in io.popen(
   "git log --full-history --decorate=full --format=%D -- "..list_of_dirs
):lines() do
   for tag in line:gmatch"tag: refs/tags/([^,]+)" do
      -- Limitation: your tag names should not contain comma
      useful[tag] = true
   end
end
for tag in io.popen"git tag":lines() do
   if not useful[tag] then
      os.execute('git tag -d "'..tag..'"')
   end
end

torek · Accepted Answer

Unfortunately there is no way to ask git filter-branch to do this automatically. See below (far below :-) ) for one idea for modifying the code to make that possible (this may be easier and more reliable than this next section.)

Fortunately, there is a way to automate discovery of tags remapped to their original commits, vs tags remapped to some other (hence ancestor) commit. Unfortunately I have never actually done this so the following is basically just theory, rather than practice.

The first step will be to build your own map. You will want, for each tag, to identify the final tagged object:

git for-each-ref --format '%(refname)' refs/tags |
    while read name; do echo $name $(git rev-parse $name^{}); done

(this two-step method, instead of using %(object), seems to be needed to map tags to final object in case a tag points to another tag; if you don't have that). The output of the above is a name-to-object map. You will need a map that corresponds to the "before" state so run this before filtering (or on an "unfiltered" copy; see below).

You may want to limit yourself to tags that ultimately point to commits (see the alternative below about modifying filter-branch).

Once you have finished your filter-branch, use the same command to obtain a new map. (Redirect both commands' outputs to a temporary file.)

If you prefer, you can do this just once, after filtering, by supplying a tag name filter that maps old tag names to unique, distinguishable new tag names. For instance if all your existing tags fit the pattern vnumber.number you could make your tag filter produce tags starting with w instead. That makes it easy to tell, in the post-filtered repo, which tag was which. Eventually you will have to rename all the tags back, of course.

Or, since you should be filtering a copy of the original repo, you can run the for-each-ref in the original repo for the "old" mapping and again in the filtered repo for the "new" mapping. Or, check into the refs/original/refs/tags/ name-space to find the original tags (I'm not sure if filter-branch saves the original tags like this, the way it saves the original branch name refs).

Your remaining task is the hard part: now we must figure out whether the new target object "is" the original target object (after filtering), or is some ancestor found via remap-to-ancestor. This is where we get theoretical, because what your filter-branch filter(s) is/are doing matters. How do we tell whether commit 89abcde "is" the filtered result of "1234567", or whether we simply skipped that commit? This, of course, depends on what your filters were.

Because filter-branch leaves all the original commits in the repository along side their copies, with the original branch names stored in refs/original/refs/..., we can see all the original commits. This means we can run through the two maps and compare the commits, or re-run the filter(s), to make this kind of discovery.

If your filters always leave the tree intact, we might be able to use git cat-file -p | headergrep tree to extract the tree IDs. If the tree IDs of old and new commit match, we preserved that commit, so we wish to keep the tag; if not, we wish to discard the tag. (Note that you must write headergrep: it's simply a grep of the contents up to the first blank line, which separates the commit headers from the commit message.)

If your filters always leave everything but the tree intact, we might be able to extract everything except the tree and parent lines. This is iffier since an old commit that reads:

tree ...
parent ...
author A U Thor  1471018671 -0700
committer A U Thor  1471018671 -0700

terriblecommitmessage

may seem the same as a new, but remapped, commit that uses the exact same message, and is by the same author and committer and made within the same second so that the time stamps match (this could happen if some commits are made by automatic software that makes multiple commits per second). In general, though, the contents of a copied commit will match (after discarding tree and parent lines), while the contents of a remapped commit won't. Hence we can hash and compare text, or compare raw text, using the equivalent of headergrep -v (which you must again write: it's a straightforward variant of our theoretical headergrep above, except that with -v we must copy the blank line and commit message, as well as all but the excluded header lines) with output sent to temporary files and cmp, or with output sent through git hash-object: we can just pretend these headergrep -v output lines are blobs and get their unique SHA-1 hash IDs, and compare those.

Of course, if your filter does something very easily identifiable, such as skipping commits with some particular author (as in one of the documentation examples), it will be easy to tell which commits were skipped and therefore caused a remap-to-ancestor.

Once we know which commits were preserved, and which were remapped, we know which tags to keep (preserved) or discard (remapped). Now it's merely a matter of deleting all the "discard" tags.

One other possibility would be to copy the filter-branch script:

$ less $(git --exec-path)/git-filter-branch
#!/bin/sh
#
# Rewrite revision history
# Copyright (c) Petr Baudis, 2006
... [snip]

Note that the tag name filter is run after the remap_to_ancestor code handles branch names that point to commits that were discarded, and hence remapped (creating "$workdir"/../map/$sha1). If you move it to run before that point you can easily tell which commits were skipped. In fact, the code to remap that tag does nothing at all if the tag's target commit is not in the map, or the tag's target is not a commit. (You would want to delete it in this case. It's not at all clear what you would want to do with commits that point to trees or blobs.)

How to disable "remap to ancestor" for tags in git-filter-branch?

Answers (1)

Related Questions

How to disable &quot;remap to ancestor&quot; for tags in git-filter-branch?

Answers (1)

Related Questions

How to disable "remap to ancestor" for tags in git-filter-branch?