rastasheep
rastasheep

Reputation: 11222

Git clean exclude nested sub directory

I'am having trouble with git clean and exclude options for nested dirs.

I would like to clean all uncommitted files from repo excluding vendor/bundle dir. My test repo loks like:

debugg-dir/
  .git/
  file.txt
  not-commited-file
  not-commited-folder
      another-not-commited-file
  vendor/
    bundle/
      another-not-commited-file

Reproduce test repo with:

git init debugg-dir && cd debugg-dir
touch file.txt && g add . && git commit -m "Commit" 
mkdir -p not-commited-folder && touch not-commited-folder/another-not-commited-file
mkdir -p vendor/bundle && touch vendor/bundle/another-not-commited-file && touch not-commited-file

Git clean command:

git clean -d -x -n -e vendor/bundle

After clean expected to have:

debugg-dir/
  .git/
  file.txt
  vendor/
    bundle/
      another-not-commited-file

Is there any proper way to exclude nested dir from git clean command?

# EDIT:

Explanation:

There is no "clean" solution for this situation. Git clean excludes dirs with git clean -d -x -n -e dir_name but this doesn't work with nested dirs. Is this bug in git or there is some good reason for that? More info why this doesnt work you can find in source. Long story short, exclude pattern only works for strings till first '/' in pattern.

My solution:

cd vendor && git clean -dxf -e bundle && cd ..
git clean -dxf -e vendor

With this i managed to only keep nested dir and it's contents.

Upvotes: 6

Views: 10063

Answers (4)

Alexandre Prevost
Alexandre Prevost

Reputation: 39

This is the code I use to clean up my git repos while excluding the venv/ directory and sub-directory:

git clean -nXd -e \!venv -e \!venv/**

For your case, the first exclusion will be enough:

git clean -nXd -e \!vendor

The second exclusion \!venv/** are for other rules in .gitignore that may apply to file or folder inside vendor. Ex:

.gitignore    
*.log

vendor/
  bundle/
    another-not-commited-file.log

Upvotes: 1

VonC
VonC

Reputation: 1324557

Git 2.24 (Q4 2019) makes git clean more robust when it comes to nested Git repositories (not just folders)

See commit 69f272b (01 Oct 2019), and commit 902b90c, commit ca8b539, commit 09487f2, commit e86bbcf, commit 3aca580, commit 29b577b, commit 89a1f4a, commit a3d89d8, commit 404ebce, commit a5e916c, commit bbbb6b0, commit 7541cc5 (17 Sep 2019) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit aafb754, 11 Oct 2019)

clean: avoid removing untracked files in a nested Git repository

Users expect files in a nested git repository to be left alone unless sufficiently forced (with two -f's).

Unfortunately, in certain circumstances, git would delete both tracked (and possibly dirty) files and untracked files within a nested repository.

To explain how this happens, let's contrast a couple cases.

First, take the following example setup (which assumes we are already within a git repo):

git init nested
cd nested
>tracked
git add tracked
git commit -m init
>untracked
cd ..

In this setup, everything works as expected; running 'git clean -fd' will result in fill_directory() returning the following paths:

nested/
nested/tracked
nested/untracked

and then correct_untracked_entries() would notice this can be compressed to:

nested/

and then since "nested/" is a directory, we would call remove_dirs("nested/", ...), which would check is_nonbare_repository_dir() and then decide to skip it.

However, if someone also creates an ignored file:

>nested/ignored

then running 'git clean -fd' would result in fill_directory() returning the same paths:

nested/
nested/tracked
nested/untracked

but correct_untracked_entries() will notice that we had ignored entries under nested/ and thus simplify this list to

nested/tracked
nested/untracked

Since these are not directories, we do not call remove_dirs() which was the only place that had the is_nonbare_repository_dir() safety check -- resulting in us deleting both the untracked file and the tracked (and possibly dirty) file.

One possible fix for this issue would be walking the parent directories of each path and checking if they represent nonbare repositories, but that would be wasteful.
Even if we added caching of some sort, it's still a waste because we should have been able to check that "nested/" represented a nonbare repository before even descending into it in the first place.
Add a DIR_SKIP_NESTED_GIT flag to dir_struct.flags and use it to prevent fill_directory() and friends from descending into nested git repos.

With this change, we also modify two regression tests added in commit 91479b9 ("t7300: add tests to document behavior of clean and nested git", 2015-06-15, Git v2.6.0-rc0).
That commit, nor its series, nor the six previous iterations of that series on the mailing list discussed why those tests coded the expectation they did.
In fact, it appears their purpose was simply to test existing behavior to make sure that the performance changes didn't change the behavior.
However, these two tests directly contradicted the manpage's claims that two -f's were required to delete files/directories under a nested git repository.
While one could argue that the user gave an explicit path which matched files/directories that were within a nested repository, there's a slippery slope that becomes very difficult for users to understand once you go down that route (e.g. what if they specified "git clean -f -d '*.c'"?)
It would also be hard to explain what the exact behavior was; avoid such problems by making it really simple.

Finally, there are still a couple bugs with -ffd not cleaning out enough (e.g. missing the nested .git) and with -ffdX possibly cleaning out the wrong files (paying attention to outer .gitignore instead of inner).
This patch does not address these cases at all (and does not change the behavior relative to those flags), it only fixes the handling when given a single -f.
See this thread for more discussion of the -ffd[X?] bugs.


With Git 2.25.1 (Feb. 2020), a corner case bugs in "git clean" that stems from a (necessarily for performance reasons) awkward calling convention in the directory enumeration API has been corrected.

See commit 0cbb605, commit ad6f215 (16 Jan 2020) by Jeff King (peff).
See commit 2270533 (16 Jan 2020) by Elijah Newren (newren).
See commit f365bf4 (16 Jan 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 7ab963e, 05 Feb 2020)

dir: treat_leading_path() and read_directory_recursive(), round 2

Signed-off-by: Elijah Newren

I was going to title this "dir: more synchronizing of treat_leading_path() and read_directory_recursive()", a nod to commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19, Git v2.25.0-rc0 -- merge), but the title was too long.

Anyway, first the backstory...

fill_directory() has always had a slightly error-prone interface: it returns a subset of paths which might match the specified pathspec; it was intended to prune away some paths which didn't match the specified pathspec and keep at least all the ones that did match it.

Given this interface, callers were responsible to post-process the results and check whether each actually matched the pathspec.

builtin/clean.c did this.

It would first prune out duplicates (e.g. if "dir" was returned as well as all files under "dir/", then it would simplify this to just "dir"), and after pruning duplicates it would compare the remaining paths to the specified pathspec(s).

This post-processing itself could run into problems, though, as noted in commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17, Git v2.24.0-rc0 -- merge listed in batch #8):

For the case of git clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both: "dir" and "dir/file"
Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with "dir".
Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed.
(Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.)

That commit fixed the issue -- when multiple pathspecs were specified -- by making sure fill_directory() wouldn't return both "dir" and "dir/file" outside the common_prefix_len optimization path.

This is where it starts to get fun.

In commit b9670c1f5e6b ("dir: fix checks on common prefix directory", 2019-12-19, Git v2.25.0-rc0 -- merge), we noticed that the common_prefix_len wasn't doing appropriate checks and letting all kinds of stuff through, resulting in recursing into .git/ directories and other craziness.

So it started locking down and doing checks on pathnames within that code path.

That continued with commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19, Git v2.25.0-rc0 -- merge), which noted the following:

Our optimization to avoid calling into read_directory_recursive() when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive() would use if we had just called it from the root.
Since it does more than call treat_path(), we need to copy that same logic.

...and then it more forcefully addressed the issue with this wonderfully ironic statement:

Needing to duplicate logic like this means it is guaranteed someonewill eventually need to make further changes and forget to update both locations.
It is tempting to just nuke the leading_directory special casing to avoid such bugs and simplify the code, but unpack_trees' verify_clean_subdirectory() also calls read_directory() and does so with a non-empty leading path, so I'm hesitant to try to restructure further.
Add obnoxious warnings to treat_leading_path() and read_directory_recursive() to try to warn people of such problems.

You would think that with such a strongly worded description, that its author would have actually ensured that the logic in treat_leading_path() and read_directory_recursive() did actually match and that everything that was needed had at least been copied over at the time that this paragraph was written.

But you'd be wrong, I messed it up by missing part of the logic.

Upvotes: -1

ThanksForAllTheFish
ThanksForAllTheFish

Reputation: 7251

As per git clean --help

git-clean - Remove untracked files from the working tree

If you add to this Floyd Pink's explanation about -d (shortly, the option allows to remove also untracked directories and not just files), then that is why you get vendor removed as well.
Now, supposedly you want to remove only not-commited-file (so, neither any untracked directory nor another-not-commited-file) I think you should git clean interactive mode, so

git clean -i

which will ask you what to do for each untracked file (only files, add -d if you want to asked for directory too).

EDIT after OP's question editing: you want to remove directories too, so run

git clean -i -d

EDIT 2: since the meaning of -e was not clear to me from the manual, I googled it and found this. I suggest reading the conversation since it explains the real meaning of -e which is not how the OP intended it (or can be understood from the manual)

EDIT 3, more about -e switch. Following the link I found in edit 2, I decided to try it out. Here the result, which I hope will help you to understand -e.
Content of .gitignore, so I don't commit temporary files:

*.tmp

I issued the commands:

echo "Temporary file" > sample.tmp
git st //which of course shows *nothing to commit, working directory clean*
git clean -fX -e \!sample.tmp

The result is that ALL files with tmp extension are deleted (due to -X) BUT sample.tmp. So, in conclusion, what -e really does, in my understanding and please correct me if in the I am wrong, is NOT to exclude patterns from the cleaning process but

to exclude patterns from the rule of cleaning (in my case the rule was to remove ALL .tmp files, from which I manually excluded sample.tmp).

Upvotes: 0

Hari Pachuveetil
Hari Pachuveetil

Reputation: 10374

This is because vendor is an untracked directory and you are using the option - d.

As the manual says:

-d

Remove untracked directories in addition to untracked files. If an untracked directory is managed by a different git repository, it is not removed by default. Use -f option twice if you really want to remove such a directory.

I could get the required output using this command:

git clean -x -n

Does that work in the real scenario? If it doesn't, you might want to commit some other file within vendor/bundle and then see.

Upvotes: 2

Related Questions