Reputation: 11224

Ways to improve git status performance

I have a repo of 10 GB on a Linux machine which is on NFS. The first time git status takes 36 minutes and subsequent git status takes 8 minutes. Seems Git depends on the OS for caching files. Only the first git commands like commit, status that involves pack/repack the whole repo takes a very long time for a huge repo. I am not sure if you have used git status on such a large repo, but has anyone come across this issue?

I have tried git gc, git clean, git repack but the time taken is still/almost the same.

Will sub-modules or any other concepts like breaking the repo into smaller ones help? If so which is the best for splitting a larger repo. Is there any other way to improve time taken for git commands on a large repo?

Upvotes: 104

Answers (14)

citysurrounded

Reputation: 674

In our codebase where we have somewhere in the range of 20 - 30 submodules,
git status --ignore-submodules
sped things up for me drastically. Do note that this will not report on the status of submodules.

See --ignore-submodules docs for additional options: "none", "untracked", "dirty" or "all". Using --ignore-submodules=dirty can give a good compromise not checking the submodule working tree files, and only report if the commit has changed.

To make this the default for all future commands: git config diff.ignoreSubmodules dirty (Thanks @d2207197)

Upvotes: 7

VonC

Reputation: 1328112

With Git 2.40 (Q1 2023), the advice message given by "git status"^(man) when it takes a long time to enumerate untracked paths has been updated.

It better illustrates all the configuration settings you can apply to get a snappier/faster git status.

See commit ecbc23e (30 Nov 2022) by Rudy Rigot (rudyrigot).
^{(Merged by Junio C Hamano -- gitster -- in commit f3d9bc8, 19 Dec 2022)}

status: modernize git-status "slow untracked files" advice

^{Signed-off-by: Rudy Rigot}

git status^(man) can be slow when there are a large number of untracked files and directories since Git must search the entire worktree to enumerate them.
When it is too slow, Git prints advice with the elapsed search time and a suggestion to disable the search using the -uno option.
This suggestion also carries a warning that might scare off some users.

However, these days, -uno isn't the only option.
Git can reduce the time taken to enumerate untracked files by caching results from previous git status invocations, when the core.untrackedCache and core.fsmonitor features are enabled.

Update the git status man page to explain these configuration options, and update the advice to provide more detail about the current configuration and to refer to the updated documentation.

git status now includes in its man page:

UNTRACKED FILES AND PERFORMANCE

git status can be very slow in large worktrees if/when it needs to search for untracked files and directories.

There are many configuration options available to speed this up by either avoiding the work or making use of cached results from previous Git commands.
There is no single optimum set of settings right for everyone.

We'll list a summary of the relevant options to help you, but before going into the list, you may want to run git status again, because your configuration may already be caching git status results, so it could be faster on subsequent runs.

The --untracked-files=no flag or the status.showUntrackedFiles=no config (see above for both): indicate that git status should not report untracked files. This is the fastest option. git status will not list the untracked files, so you need to be careful to remember if you create any new files and manually git add them.

advice.statusUoption=false (see git config): setting this variable to false disables the warning message given when enumerating untracked files takes more than 2 seconds. In a large project, it may take longer and the user may have already accepted the trade off (e.g. using "-uno" may not be an acceptable option for the user), in which case, there is no point issuing the warning message, and in such a case, disabling the warning may be the best.

core.untrackedCache=true (see git update-index): enable the untracked cache feature and only search directories that have been modified since the previous git status command.
Git remembers the set of untracked files within each directory and assumes that if a directory has not been modified, then the set of untracked files within has not changed.

This is much faster than enumerating the contents of every directory, but still not without cost, because Git still has to search for the set of modified directories. The untracked cache is stored in the .git/index file. The reduced cost of searching for untracked files is offset slightly by the increased size of the index and the cost of keeping it up-to-date. That reduced search time is usually worth the additional size.

core.untrackedCache=true and core.fsmonitor=true or core.fsmonitor=<hook_command_pathname> (see git update-index): enable both the untracked cache and FSMonitor features and only search directories that have been modified since the previous git status command.

This is faster than using just the untracked cache alone because Git can also avoid searching for modified directories.
Git only has to enumerate the exact set of directories that have changed recently. While the FSMonitor feature can be enabled without the untracked cache, the benefits are greatly reduced in that case.

Note that after you turn on the untracked cache and/or FSMonitor features it may take a few git status commands for the various caches to warm up before you see improved command times.
This is normal.

Upvotes: 5

Josh Lee

Reputation: 177775

To be more precise, git depends on the efficiency of the lstat(2) system call, so tweaking your client’s “attribute cache timeout” might do the trick.

The manual for git-update-index — essentially a manual mode for git-status — describes what you can do to alleviate this, by using the --assume-unchanged flag to suppress its normal behavior and manually update the paths that you have changed. You might even program your editor to unset this flag every time you save a file.

The alternative, as you suggest, is to reduce the size of your checkout (the size of the packfiles doesn’t really come into play here). The options are a sparse checkout, submodules, or Google’s repo tool.

(There’s a mailing list thread about using Git with NFS, but it doesn’t answer many questions.)

Upvotes: 53

Contango

Reputation: 80382

As a test, try temporarily disabling realtime protection for antivirus software. If that's the issue, swap your antivirus.

Case in point: I had Webroot running, and it was taking 30 to 60 seconds to do anything with Git. Paused the realtime protection, and suddenly my original performance was back, with sub-second updates and a fast, snappy system.

I chose Webroot as it is famed for minimal impact on system performance, but in this case it was pouring metaphorical molasses into my CPU.

Upvotes: 0

ankostis

Reputation: 9493

A frequent cause of slowness for big repos is status command's up-to-date check with the remote branch - set this repo-level configuration to disable it:

git config status.aheadBehind false

Upvotes: 1

VonC

Reputation: 1328112

The performance of git status should improve with Git 2.13 (Q2 2017).

See commit 950a234 (14 Apr 2017) by Jeff Hostetler (jeffhostetler).
^{(Merged by Junio C Hamano -- gitster -- in commit 8b6bba6, 24 Apr 2017)}

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

Use ALLOC_GROW() macro when reallocing a string_list array rather than simply increasing it by 32.
This is a performance optimization.

During status on a very large repo and there are many changes, a significant percentage of the total run time is spent reallocing the wt_status.changes array.

This change decreases the time in wt_status_collect_changes_worktree() from 125 seconds to 45 seconds on my very large repository.

Plus, Git 2.17 (Q2 2018) will introduce a new trace, for measuring where the time is spent in the index-heavy operations.

See commit ca54d9b (27 Jan 2018) by Nguyễn Thái Ngọc Duy (pclouds).
^{(Merged by Junio C Hamano -- gitster -- in commit 090dbea, 15 Feb 2018)}

trace: measure where the time is spent in the index-heavy operations

All the known heavy code blocks are measured (except object database access). This should help identify if an optimization is effective or not.
An unoptimized git-status would give something like below:
0.001791141 s: read cache ...
0.004011363 s: preload index
0.000516161 s: refresh index
0.003139257 s: git command: ... 'status' '--porcelain=2'
0.006788129 s: diff-files
0.002090267 s: diff-index
0.001885735 s: initialize name hash
0.032013138 s: read directory
0.051781209 s: git command: './git' 'status'

The same Git 2.17 (Q2 2018) improves git status with:

commit f39a757, commit 3ca1897, commit fd9b544, commit d7d1b49 (09 Jan 2018) by Jeff Hostetler (jeffhostetler).
^{(Merged by Junio C Hamano -- gitster -- in commit 4094e47, 08 Mar 2018)}
"git status" can spend a lot of cycles to compute the relation between the current branch and its upstream, which can now be disabled with "--no-ahead-behind" option.
commit ebbed3b (25 Feb 2018) by Derrick Stolee (derrickstolee).

revision.c: reduce object database queries

In mark_parents_uninteresting(), we check for the existence of an object file to see if we should treat a commit as parsed. The result is to set the "parsed" bit on the commit.

Modify the condition to only check has_object_file() if the result would change the parsed bit.

When a local branch is different from its upstream ref, "git status" will compute ahead/behind counts.
This uses paint_down_to_common() and hits mark_parents_uninteresting().

On a copy of the Linux repo with a local instance of "master" behind the remote branch "origin/master" by ~60,000 commits, we find the performance of "git status" went from 1.42 seconds to 1.32 seconds, for a relative difference of -7.0%.

Git 2.24 (Q3 2019) proposes another setting to improve git status performance:

See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit f4f8dfe, 09 Sep 2019)}

repo-settings: create feature.manyFiles setting

The feature.manyFiles setting is suitable for repos with many files in the working directory.
By setting index.version=4 and core.untrackedCache=true, commands such as 'git status' should improve.

But:

With Git 2.24 (Q4 2019), the codepath that reads the index.version configuration was broken with a recent update, which has been corrected.

See commit c11e996 (23 Oct 2019) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit 4d6fb2b, 24 Oct 2019)}

repo-settings: read an int for index.version

^{Signed-off-by: Derrick Stolee}

Several config options were combined into a repo_settings struct in ds/feature-macros, including a move of the "index.version" config setting in 7211b9e ("repo-settings: consolidate some config settings", 2019-08-13, Git v2.24.0-rc1 -- merge listed in batch #0).

Unfortunately, that file looked like a lot of boilerplate and what is clearly a factor of copy-paste overload, the config setting is parsed with repo_config_ge_bool() instead of repo_config_get_int(). This means that a setting "index.version=4" would not register correctly and would revert to the default version of 3.

I caught this while incorporating v2.24.0-rc0 into the VFS for Git codebase, where we really care that the index is in version 4.

This was not caught by the codebase because the version checks placed in t1600-index.sh did not test the "basic" scenario enough. Here, we modify the test to include these normal settings to not be overridden by features.manyFiles or GIT_INDEX_VERSION.
While the "default" version is 3, this is demoted to version 2 in do_write_index() when not necessary.

git status will also compare SHA1 faster, due to Git 2.33 (Q3 2021), using an optimized hashfile API in the codepath that writes the index file.

See commit f6e2cd0, commit 410334e, commit 2ca245f (18 May 2021), and commit 68142e1 (17 May 2021) by Derrick Stolee (derrickstolee).
^{(Merged by Junio C Hamano -- gitster -- in commit 0dd2fd1, 14 Jun 2021)}

csum-file.h: increase hashfile buffer size

^{Signed-off-by: Derrick Stolee}

The hashfile API uses a hard-coded buffer size of 8KB and has ever since it was introduced in c38138c ("git-pack-objects: write the pack files with a SHA1 csum", 2005-06-26, Git v0.99 -- merge).
It performs a similar function to the hashing buffers in read-cache.c, but that code was updated from 8KB to 128KB in f279894 ("read-cache: make the index write buffer size 128K", 2021-02-18, Git v2.31.0-rc1 -- merge).
The justification there was that do_write_index() improves from 1.02s to 0.72s.
Since our end goal is to have the index writing code use the hashfile API, we need to unify this buffer size to avoid a performance regression.

Since these buffers are now on the heap, we can adjust their size based on the needs of the consumer.
In particular, callers to hashfd_throughput() are expecting to report progress indicators as the buffer flushes.
These callers would prefer the smaller 8k buffer to avoid large delays between updates, especially for users with slower networks.
When the progress indicator is not used, the larger buffer is preferable.

By adding a new trace2 region in the chunk-format API, we can see that the writing portion of 'git multi-pack-index write'^(man) lowers from ~1.49s to ~1.47s on a Linux machine.
These effects may be more pronounced or diminished on other filesystems.

Upvotes: 11

Jabari

Reputation: 5519

Try git gc. Also, git clean may help.

The git manual states:

Runs a number of housekeeping tasks within the current repository, such as compressing file revisions (to reduce disk space and increase performance) and removing unreachable objects which may have been created from prior invocations of git add.

Users are encouraged to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance.

I always notice a difference after running git gc when git status is slow!

UPDATE II - Not sure how I missed this, but the OP already tried git gc and git clean. I swear that wasn't originally there, but I don't see any changes in the edits. Sorry for that!

Upvotes: 40

Mosè Bottacini

Reputation: 4226

Ok, this is quite hard to believe if I wouldn't see with my eyes... I had very BAD performance on my brand new work laptop, git status takes from 5 to 10 seconds to complete even for the most stupid repository. I've tried all the advice in this thread then I noticed that also git log was slow so I've broad my search for generic slowness of git fresh installation and I've found this https://github.com/gitextensions/gitextensions/issues/5314#issuecomment-416081823

in a desperate move I've tried to update the graphic driver of my laptop and...

Holy Santa Claus sh*t... that did the trick!

...for me too!

So apparently graphic card driver have some relation here... hard to understand why, but now the performance are "as expected"!

Upvotes: 2

MS_

Reputation: 851

It is a pretty old question. Though, I am surprised that no one commented about binary file given the repository size.

You mentioned that your git repo is ~10GB. It seems that apart from NFS issue and other git issues (resolvable by git gc and git configuration change as outline in other answers), git commands (git status, git diff, git add) might be slow because of large number of binary file in the repository. git is not good at handling binary file. You can remove unnecessary binary file using following command (example is given for NetCDF file; have a backup of git repository before):

git filter-branch --force --index-filter \  
'git rm --cached --ignore-unmatch *.nc' \   
--prune-empty --tag-name-filter cat -- --all

Do not forget to put '*.nc' to gitignore file to stop git from recommit the file.

Upvotes: -1

nh2

Reputation: 25695

Leftover `index.lock` files

git status can be pathologically slow when you have leftover index.lock files.

This happens especially when you have git submodules, because then you often don't notice such lefterover files.

Summary: Run find .git/ -name index.lock, and delete the leftover files after checking that they are indeed not used by any currently running program.

Details

I found that my shell git status was extremely slow in my repo, with git 2.19 on Ubuntu 16.04.

Dug in and found that /usr/bin/time git status in my assets git submodule took 1.7 seconds.

Found with strace that git read all my big files in there with mmap. It doesn't usually do that, usually stat is enough.

I googled the problem and found the Use of index and Racy Git problem.

Tried git update-index somefile (in my case gitignore in the submodule checkout) shown here but it failed with

fatal: Unable to create '/home/niklas/src/myproject/.git/modules/assets/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.

This is a classical error. Usually you notice it at any git operation, but for submodules that you don't often commit to, you may not notice it for months, because it only appears when adding something to the index; the warning is not raised on read-only git status.

Removing the index.lock file, git status became fast immediately, mmaps disappeared, and it's now over 1000x faster.

So if your git status is unnaturally slow, check find .git/ -name index.lock and delete the leftovers.

Upvotes: 1

dCSeven

Reputation: 915

Something that hasn't been mentioned yet is, to activate the filesystem cache on windows machines (linux filesystems are completly different and git was optimized for them, therefore this probably only helps on windows).

git config core.fscache true

As a last resort, if git is still slow, one could turn off the modification time inspection, that git needs to find out which files have changed.

git config core.ignoreStat true

BUT: Changed files have to be added afterwards by the dev himself with git add. Git doesn't find changes itself.

source

Upvotes: 5

klimat

Reputation: 24991

git config --global core.preloadIndex true

Did the job for me. Check the official documentation here.

Upvotes: 7

user1077329

Reputation: 401

I'm also seeing this problem on a large project shared over NFS.

It took me some time to discover the flag -uno that can be given to both git commit and git status.

What this flag does is to disable looking for untracked files. This reduces the number of nfs operations significantly. The reason is that in order for git to discover untracked files it has to look in all subdirectories so if you have many subdirectories this will hurt you. By disabling git from looking for untracked files you eliminate all these NFS operations.

Combine this with the core.preloadindex flag and you can get resonable perfomance even on NFS.

Upvotes: 40

beno

Reputation: 2190

If your git repo makes heavy use of submodules, you can greatly speed up the performance of git status by editing the config file in the .git directory and setting ignore = dirty on any particularly large/heavy submodules. For example:

[submodule "mysubmodule"]
url = ssh://mysubmoduleURL
ignore = dirty

You'll lose the convenience of a reminder that there are unstaged changes in any of the submodules that you may have forgotten about, but you'll still retain the main convenience of knowing when the submodules are out of sync with the main repo. Plus, you can still change your working directory to the submodule itself and use git status within it as per usual to see more information. See this question for more details about what "dirty" means.

Upvotes: 27

Ways to improve git status performance

Answers (14)

`status`: modernize git-status "slow untracked files" advice

UNTRACKED FILES AND PERFORMANCE

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

`trace`: measure where the time is spent in the index-heavy operations

`revision.c`: reduce object database queries

repo-settings: create feature.manyFiles setting

`repo-settings`: read an int for index.version

`csum-file.h`: increase hashfile buffer size

Leftover `index.lock` files

Details

Related Questions

Ways to improve git status performance

Answers (14)

status: modernize git-status "slow untracked files" advice

UNTRACKED FILES AND PERFORMANCE

> string-list: use ALLOC_GROW macro when reallocing string_list

trace: measure where the time is spent in the index-heavy operations

revision.c: reduce object database queries

repo-settings: create feature.manyFiles setting

repo-settings: read an int for index.version

csum-file.h: increase hashfile buffer size

Leftover index.lock files

Details

Related Questions

`status`: modernize git-status "slow untracked files" advice

> `string-list`: use `ALLOC_GROW` macro when reallocing `string_list`

`trace`: measure where the time is spent in the index-heavy operations

`revision.c`: reduce object database queries

`repo-settings`: read an int for index.version

`csum-file.h`: increase hashfile buffer size

Leftover `index.lock` files