Sahil Singh
Sahil Singh

Reputation: 3787

How to manage a large git repository on a slow hard drive?

A project that I am working on has grown organically, and the size, number of files, type of files etc. in the repo have grown way too much. I have searched for several optimizations to git, and nothing seems to perfectly fit my situation. Here is what I want.

  1. Manually track files - When I will edit a file, I will manually do git add <file-name>. Git's assume-unchanged won't help, since I will have to do a --no-assume-unchanged before every add.

  2. Git commit should only add the files I have staged in the index and not worry about any other file. I have seen git taking too much time even after using core.ignoreStat.

  3. A sparse checkout should not download the entire repository first (it is a very big repository, even if I use --depth 1). (However, it may not be possible with git)

  4. My repository is such that, although there are a lot of directories, I only work in a small set of directories for a time, and then in other set at a later time. All the directories are rarely required at a time. It would be good if there could be a command, say git hide <directory> which hides the directory in the working tree, and relieves git from tracking it until I need it again. I am already using core.ignoreStat,status.showUntrackedFiles,commit.status. Here is my git config.

    user.email=xxx@xxx
    core.repositoryformatversion=0
    core.filemode=true
    core.bare=false
    core.logallrefupdates=true
    core.ignorestat=true
    core.showuntrackedfiles=no
    remote.git_ch.url=file:////home/xxx/git_server/linux-namespaces.git
    remote.git_ch.fetch=+refs/heads/*:refs/remotes/git_ch/*
    branch.master.remote=git_ch
    branch.master.merge=refs/heads/master
    status.showuntrackedfiles=no
    commit.status=false
    

The repository is still too slow.

Additionally, can you suggest the possible reason for it being so slow, out of these?

There are several git extensions like git annex, Google's git repo, etc. Will using any of these be of help, or will it be better to switch to another VCS?

I am using Ubuntu Gnome 16.04.1.

Upvotes: 1

Views: 1280

Answers (2)

VonC
VonC

Reputation: 1323753

Note: Git 2.13 might help alleviate the support for large index repos:

See commit b460139, commit b2dd1c5, commit c3a0082, commit de6ae5f, commit c0441f7, commit b968372 (06 Mar 2017), and commit 77d6797, commit 0d59ffb, commit 6a5e6f5, commit e77cf4e, commit fcdbd95, commit e6a1dd7, commit 72dcb7b, commit 13c0e4c, commit 66f9e7a, commit b8923bf, commit 6cc1053, commit 4392531, commit cef4fc7, commit 1f44b09 (27 Feb 2017) by Christian Couder (chriscool).
(Merged by Junio C Hamano -- gitster -- in commit 94c9b5a, 17 Mar 2017)

The revised git update-index documentation now includes a Split Index section:

Split index

This mode is designed for repositories with very large indexes, and aims at reducing the time it takes to repeatedly write these indexes.

In this mode, the index is split into two files,

  • $GIT_DIR/index
  • $GIT_DIR/sharedindex.<SHA-1>.

Changes are accumulated in $GIT_DIR/index, the split index, while the shared index file contains all index entries and stays unchanged.

All changes in the split index are pushed back to the shared index file when the number of entries in the split index reaches a level specified by the splitIndex.maxPercentChange config variable.

Each time a new shared index file is created, the old shared index files are deleted if their modification time is older than what is specified by the splitIndex.sharedIndexExpire config variable.

To avoid deleting a shared index file that is still used, its modification time is updated to the current time every time a new split index based on the shared index file is either created or read from.

So you now (Git 2.13, Q2 2017) have the configurations:

splitIndex.maxPercentChange:

When the split index feature is used, this specifies the percent of entries the split index can contain compared to the total number of entries in both the split index and the shared index before a new shared index is written.

The value should be between 0 and 100.
If the value is 0 then a new shared index is always written,
if it is 100 a new shared index is never written.

By default the value is 20, so a new shared index is written if the number of entries in the split index would be greater than 20 percent of the total number of entries.

And:

splitIndex.sharedIndexExpire::

When the split index feature is used, shared index files that were not modified since the time this variable specifies will be removed when a new shared index file is created.

The value "now" expires all entries immediately, and "never" suppresses expiration altogether.

The default value is "2.weeks.ago".

Note that a shared index file is considered modified (for the purpose of expiration) each time a new split-index file is either created based on it or read from it.


With Git 2.27 (Q2 2020), the code that refreshes the last access and modified time of on-disk packfiles and loose object files have been updated.

See commit 312cd76 (14 Apr 2020) by [email protected] (``).
(Merged by Junio C Hamano -- gitster -- in commit 51a68dd, 28 Apr 2020)

freshen_file(): use NULL times for implicit current-time

Signed-off-by: Luciano Miguel Ferreira Rocha

Update freshen_file() to use a NULL times, semantically equivalent to the currently setup, with an explicit actime and modtime set to the "current time", but with the advantage that it works with other files not owned by the current user.

Fixes an issue on shared repos with a split index, where eventually a user's operation creates a shared index, and another user will later do an operation that will try to update its freshness, but will instead raise a warning:

$ git status
warning: could not freshen shared index '.git/sharedindex.bd736fa10e0519593fefdb2aec253534470865b2'

Upvotes: 3

Olivier Duhart
Olivier Duhart

Reputation: 437

adding an SSD wil definetly do the job. I am facing the exact same problem. A colleague has a computer with an SSD and his computer is far more quicker to do every single git action. I tried all you stated but the problem is really low IO perfomance. Git is using a lot of tiny little files (look at your .git directory) to manage different version of files and so the poor IO latency is slowing it all.

Upvotes: 0

Related Questions