Reputation: 75
I plan to use LibGit2/LibGit2Sharp and hence GIT in an unorthodox manner and I am asking anyone familiar with the API to confirm that what I propose will in theory work. :)
Scenario
Only the master branch will exist in a repository. A large number of directories containing large binary and non-binary files will be tracked and committed. Most of the binary files will change between commits. The repository should contain no more than 10 commits due to disk space limitations (disk fills up quite often now).
What the API does not provide is a function that will truncate commit history starting at a specified CommitId back to the initial commit of the master branch and delete any GIT objects that would be dangling as a result.
I have tested using the ReferenceCollection.RewiteHistory method and I can use it to remove the parents from a commit. This creates me a new commit history starting at CommitId going back to the HEAD. But that still leaves all of the old commits and any references or blobs that are unique to those commits. My plan right now is to simply clean up these dangling GIT objects myself. Does anyone see any problems with this approach or have a better one?
Upvotes: 3
Views: 1128
Reputation: 75
Per the suggestions above here is the preliminary (Still Testing) C# code that I came up with that will truncate the master branch at a specific SHA creating a new initial commit. It also removes all dangling references and Blobs
public class RepositoryUtility
{
public RepositoryUtility()
{
}
public String[] GetPaths(Commit commit)
{
List<String> paths = new List<string>();
RecursivelyGetPaths(paths, commit.Tree);
return paths.ToArray();
}
private void RecursivelyGetPaths(List<String> paths, Tree tree)
{
foreach (TreeEntry te in tree)
{
paths.Add(te.Path);
if (te.TargetType == TreeEntryTargetType.Tree)
{
RecursivelyGetPaths(paths, te.Target as Tree);
}
}
}
public void TruncateCommits(String repositoryPath, Int32 maximumCommitCount)
{
IRepository repository = new Repository(repositoryPath);
Int32 count = 0;
string newInitialCommitSHA = null;
foreach (Commit masterCommit in repository.Head.Commits)
{
count++;
if (count == maximumCommitCount)
{
newInitialCommitSHA = masterCommit.Sha;
}
}
//there must be parent commits to the commit we want to set as the new initial commit
if (count > maximumCommitCount)
{
TruncateCommits(repository, repositoryPath, newInitialCommitSHA);
}
}
private void RecursivelyCheckTreeItems(Tree tree,Dictionary<String, TreeEntry> treeItems, Dictionary<String, GitObject> gitObjectDeleteList)
{
foreach (TreeEntry treeEntry in tree)
{
//if the blob does not exist in a commit before the truncation commit then add it to the deletion list
if (!treeItems.ContainsKey(treeEntry.Target.Sha))
{
if (!gitObjectDeleteList.ContainsKey(treeEntry.Target.Sha))
{
gitObjectDeleteList.Add(treeEntry.Target.Sha, treeEntry.Target);
}
}
if (treeEntry.TargetType == TreeEntryTargetType.Tree)
{
RecursivelyCheckTreeItems(treeEntry.Target as Tree, treeItems, gitObjectDeleteList);
}
}
}
private void RecursivelyAddTreeItems(Dictionary<String, TreeEntry> treeItems, Tree tree)
{
foreach (TreeEntry treeEntry in tree)
{
//check for existance because if a file is renamed it can exist under a tree multiple times with the same SHA
if (!treeItems.ContainsKey(treeEntry.Target.Sha))
{
treeItems.Add(treeEntry.Target.Sha, treeEntry);
}
if (treeEntry.TargetType == TreeEntryTargetType.Tree)
{
RecursivelyAddTreeItems(treeItems, treeEntry.Target as Tree);
}
}
}
private void TruncateCommits(IRepository repository, String repositoryPath, string newInitialCommitSHA)
{
//get a repository object
Dictionary<String, TreeEntry> treeItems = new Dictionary<string, TreeEntry>();
Commit selectedCommit = null;
Dictionary<String, GitObject> gitObjectDeleteList = new Dictionary<String, GitObject>();
//loop thru the commits starting at the head moving towards the initial commit
foreach (Commit masterCommit in repository.Head.Commits)
{
//if non null then we have already found the commit where we want the truncation to occur
if (selectedCommit != null)
{
//since this is a commit after the truncation point add it to our deletion list
gitObjectDeleteList.Add(masterCommit.Sha, masterCommit);
//check the blobs of this commit to see if they should be deleted
RecursivelyCheckTreeItems(masterCommit.Tree, treeItems, gitObjectDeleteList);
}
else
{
//have we found the commit that we want to be the initial commit
if (String.Equals(masterCommit.Sha, newInitialCommitSHA, StringComparison.CurrentCultureIgnoreCase))
{
selectedCommit = masterCommit;
}
//this commit is before the new initial commit so record the tree entries that need to be kept.
RecursivelyAddTreeItems(treeItems, masterCommit.Tree);
}
}
//this function simply clears out the parents of the new initial commit
Func<Commit, IEnumerable<Commit>> rewriter = (c) => { return new Commit[0]; };
//perform the rewrite
repository.Refs.RewriteHistory(new RewriteHistoryOptions() { CommitParentsRewriter = rewriter }, selectedCommit);
//clean up references now in origional and remove the commits that they point to
foreach (var reference in repository.Refs.FromGlob("refs/original/*"))
{
repository.Refs.Remove(reference);
//skip branch reference on file deletion
if (reference.CanonicalName.IndexOf("master", 0, StringComparison.CurrentCultureIgnoreCase) == -1)
{
//delete the Blob from the file system
DeleteGitBlob(repositoryPath, reference.TargetIdentifier);
}
}
//now remove any tags that reference commits that are going to be deleted in the next step
foreach (var reference in repository.Refs.FromGlob("refs/tags/*"))
{
if (gitObjectDeleteList.ContainsKey(reference.TargetIdentifier))
{
repository.Refs.Remove(reference);
}
}
//remove the commits from the GIT ObectDatabase
foreach (KeyValuePair<String, GitObject> kvp in gitObjectDeleteList)
{
//delete the Blob from the file system
DeleteGitBlob(repositoryPath, kvp.Value.Sha);
}
}
private void DeleteGitBlob(String repositoryPath, String blobSHA)
{
String shaDirName = System.IO.Path.Combine(System.IO.Path.Combine(repositoryPath, ".git\\objects"), blobSHA.Substring(0, 2));
String shaFileName = System.IO.Path.Combine(shaDirName, blobSHA.Substring(2));
//if the directory exists
if (System.IO.Directory.Exists(shaDirName))
{
//get the files in the directory
String[] directoryFiles = System.IO.Directory.GetFiles(shaDirName);
foreach (String directoryFile in directoryFiles)
{
//if we found the file to delete
if (String.Equals(shaFileName, directoryFile, StringComparison.CurrentCultureIgnoreCase))
{
//if readonly set the file to RW
FileInfo fi = new FileInfo(shaFileName);
if (fi.IsReadOnly)
{
fi.IsReadOnly = false;
}
//delete the file
File.Delete(shaFileName);
//eliminate the directory if only one file existed
if (directoryFiles.Length == 1)
{
System.IO.Directory.Delete(shaDirName);
}
}
}
}
}
}
Thanks for all of your help. It is sincerely appreciated. Please note I edited this code from the original because it did not take into account directories.
Upvotes: 0
Reputation: 67659
But that still leaves all of the old commits and any references or blobs that are unique to those commits. My plan right now is to simply clean up these dangling GIT objects myself.
While rewriting the history of the repository, LibGit2Sharp takes care of not discarding the rewritten reference. The namespace under which they are stored is, by default, refs/original
. This can be changed through the RewriteHistoryOptions
parameter.
In order to remove old commits, trees and blobs, one would first have to remove those references. This can be achieved with the following code:
foreach (var reference in repo.Refs.FromGlob("refs/original/*"))
{
repo.Refs.Remove(reference);
}
Next step would be purge the now dangling git objects. However, this cannot be done through LibGit2Sharp (yet). One option would be to shell out to git the following command
git gc --aggressive
This will reduce, in a very effective/destructive/non recoverable way, the size of your repository.
Does anyone see any problems with this approach or have a better one?
Your approach looks valid.
Does anyone see any problems with this approach or have a better one?
If the limit is the disk size, another option would be to use a tool like git-annex or git-bin to store large binary files outside of the git repository. See this SO question to get some different views on the subject and potential drawbacks (deployment, lock-in, ...).
I will try the RewriteHistoryOptions and foreach code that you provided. However, for now it looks like File.Delete on dangling git objects for me.
Beware, this may be a bumpy road to go
.git\objects
folder are usually read-only files. File.Delete
can't remove them in this state. You'd have to unset the read-only attribute first with a call to File.SetAttributes(path, FileAttributes.Normal);
, for instance.Tree
s and Blob
s may turn into quite a complex task.Upvotes: 3