Jau A
Jau A

Reputation: 497

Delete cached data from DVC

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.

Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:

dvc pull <my_data_folder.dvc>

This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)

ls -l my_data_folder

You will see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...

Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:

Preliminary step (optional)

Not needed if you have symlinks (which I believe is true, at least in unix-like OS):

dvc unprotect my_data_folder

Approach 1 (verified):

Delete all the cached data. From the repo's root folder:

rm -r my_data_folder
rm -rf .dvc/cache

This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.

Approach 2 (NOT verified):

Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):

First, take note of the path indicated in the soft link:

ls -l my_data_folder

You will see something like:

my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

If you want to delete my_data_file_1.pk, from the repo's root folder run:

rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792

Note on dvc gc

For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.

I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?

Thank you!

Upvotes: 3

Views: 6094

Answers (1)

Shcheklein
Shcheklein

Reputation: 6349

It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:

For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.

This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.

So, let's say you are building a model:

my_model_file.pk

You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.

Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

Upvotes: 1

Related Questions