nguyendhn
nguyendhn

Reputation: 563

Revert a dvc remove -p command

I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?

Upvotes: 3

Views: 1376

Answers (1)

Shcheklein
Shcheklein

Reputation: 6349

You should be safe (at least data is not gone) most likely. From the dvc remove docs:

Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run dvc push to save the files you actually want to use or share in the future.

So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.

There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).

Recovering a directory

First of all, here is the document that describes briefly how DVC stores directories in the cache.

What we can do is to find all .dir files in the .dvc/cache:

find .dvc/cache -type f -name "*.dir"

outputs something like:

.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir

(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)

Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):

  • Check the date modified (if you remember when this data was added).
  • Check the content of those files - if you remember a specific file name that was present only in the directory you are looking for - just grep it.
  • Try to restore them one by one and check the directory content.

Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:

[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]

and we want to get a directory with train.tsv).

The only thing we need to do is to create a .dvc file that references this directory:

outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
  path: my-directory

(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)

And run dvc pull on this file.

That should be it.

Upvotes: 3

Related Questions