Reputation: 563
I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p
, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?
Upvotes: 3
Views: 1376
Reputation: 6349
You should be safe (at least data is not gone) most likely. From the dvc remove
docs:
Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run
dvc push
to save the files you actually want to use or share in the future.
So, if you created training_data.dvc
as with dvc add
and/or dvc run
and dvc remove -p
didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache
.
There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc
or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).
First of all, here is the document that describes briefly how DVC stores directories in the cache.
What we can do is to find all .dir
files in the .dvc/cache
:
find .dvc/cache -type f -name "*.dir"
outputs something like:
.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir
(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)
Each .dir
file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):
Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
, (e.g. because content of it looks like:
[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]
and we want to get a directory with train.tsv
).
The only thing we need to do is to create a .dvc
file that references this directory:
outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
path: my-directory
(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)
And run dvc pull
on this file.
That should be it.
Upvotes: 3