sizif
sizif

Reputation: 344

matching tarball to a git repository

Given a git repository and a tarball with no revision information. The tree in the tarball originated from the repository at some point in the past and changed quite a bit. The repository has changed quite a bit as well. The commit where the tarball tree was copied from the repository is unknown. The task is to find the commit which is most close to the tarball, to examine the changes in the tarball tree or to graft the tarball tree back to the repository.

I did this before with manual dichotomic search, minimizing the output of diff -ruN gitrepo tartree | wc -c. I wondered if there is a tool that can automate the task?

Upvotes: 2

Views: 252

Answers (2)

sizif
sizif

Reputation: 344

Thank you fredrik and Ôrel for the comments. I understand the original commit may or may not be discoverable, so I said "most close". I have coded a linear brute force search and it does find a nice extremum much faster than the manual consideration I did before... especially if you guess well from which commit to start the search.

(update: the script shortened by using git log --pretty=format as suggested by LeGEC).

#!/usr/bin/perl

# Estimate similarity of $DIR to every commit in ```git log``` output,
# output a line for every commit.  ```git log``` starts from the
# currently checked out commit and goes back in time.
#
# The script is quick and dirty: it checks out every commit in turn to
# take a diff.  After the script stops for whatever reason, the last
# commit seen stays checked out.  You will have to restore the original
# checkout yourself.

sub usage {
    die ("Usage:\n",
         "  cd clean-git-repo\n",
         "  git-match-dir DIR\n");
}

sub main {
    my $dir = $ARGV[0] // usage();
    open (my $fh, "git log --pretty='%H %ad'|") or die;
    while (<$fh>) {
        # d2e9457319bff7326d5162b47dd4891c652c2089 Thu Sep 14 09:44:58 2017 +0300
        my ($commit, $date) = /(\w+) \w\w\w (.*)/;
        $commit or die "unexpected output from git log: $_";
        my $out = `git checkout $commit 2>&1`;
        $? == 0 or die "$out\nCheckout error.  Stop";
        my $len = 0 + `diff -wruN --exclude .git . $dir | wc -c`;
        printf("%10u %s %s\n", $len, $commit, $date);
    }
}

main();
exit 0;

Upvotes: 2

LeGEC
LeGEC

Reputation: 51850

If the tarball is the exact content of one of your repo's commit, you can search for the tree hash :

  1. use git to compute the hash for the tree of your tarball
  2. print the list of commit-hash tree-hash
  3. grep 1. in 2.

  1. In an empty directory :
  • create a repo
  • untar your tarball
  • run git add -a && git commit
  • run git rev-parse HEAD^{tree}
  1. In your git repo, run :

    git log --all --pretty=format:"%H %T"

  2. grep the output of 1. in the list produced by 2.

Upvotes: 0

Related Questions