Mokus
Mokus

Reputation: 10400

How git handles file deletion and re-adding?

I would like to know how git handles file manipulation. Let's say I delete FileA and after two commits I re-add the same file to the same path. Will FileA stored as a new file copy in the git history or the one exist, two commits before will be linked to the current commit? What happens in the case if FileA is slightly changed?

Upvotes: 2

Views: 82

Answers (1)

Schwern
Schwern

Reputation: 164759

tl;dr Git stores the contents of files separate from the filename. If the content is the same, it will reuse the existing content. If it is even slightly modified it will store a new copy. Periodically it will store just the content changes in packfiles.

When Git stores a file it stores it in two objects.

  1. The tree
  2. The blob (Binary Large OBject)

The tree is basically a directory listing. It contains the names of the files and directories, their permissions, what type of object it is (blob or tree) and the ID of their object.

100644 blob a906cb2a4a904a152e80877d4088654daad0c859      somefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      somedir

Then it stores the content of the file, compressed, in the blob. The above says the content of somefile is stored in the blob a906cb2a4a904a152e80877d4088654daad0c859.

If you have two files with the same content, Git will use the same blob for both files.

If you git rm somefile and commit Git will create a new tree without the file and attach that to the commit. Since it is referenced by earlier trees in earlier commits the blob will stick around.

040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      somedir

If you git add newfile with the same content as the old one Git will reuse the same blob.

100644 blob a906cb2a4a904a152e80877d4088654daad0c859      newfile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      somedir

What happens in the case if FileA is slightly changed?

Git will store a new blob object with the complete content of the new file.

100644 blob 8f94139338f9404f26296befa88755fc2598c289      somefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      somedir

Git will eventually optimize this by putting all the individual objects into packfiles which can store only the deltas.

See Git Objects in Pro Git for more.


Here's a quick demonstration.

$ echo 'Basset hounds got long ears' > FileA
$ git add FileA
$ git commit -m First
[main (root-commit) af9df46] First
 1 file changed, 1 insertion(+)
 create mode 100644 FileA
$ git hash-object FileA
34f45be4cebdae4cf67218bd47df88dcd9a4cdc6
$ tree .git/objects/
.git/objects/
├── 34
│   └── f45be4cebdae4cf67218bd47df88dcd9a4cdc6
├── af
│   └── 9df4604a35039b68625b8283d7b36fb0409136
├── e5
│   └── d8ddccedc871c546b4f6bf0e316165786c62ba
├── info
└── pack

5 directories, 3 files

af9df46 is the commit object. e5d8dcced is the tree object. 34f45be4ce is the blob object containing the content of FileA.

$ git rm FileA
rm 'FileA'
$ git commit -m Second
[main 3bcbfae] Second
 1 file changed, 1 deletion(-)
 delete mode 100644 FileA
$ tree .git/objects/
.git/objects/
├── 34
│   └── f45be4cebdae4cf67218bd47df88dcd9a4cdc6
├── 3b
│   └── cbfae6e607ef605b572f2b88ea21ad021b030b
├── 4b
│   └── 825dc642cb6eb9a060e54bf8d69288fbee4904
├── af
│   └── 9df4604a35039b68625b8283d7b36fb0409136
├── e5
│   └── d8ddccedc871c546b4f6bf0e316165786c62ba
├── info
└── pack

7 directories, 5 files

3bcbfae6 is the second commit object. 4b825dc is the new tree object. Note that the 34f45be4ce blob is still there.

$ echo 'Basset hounds got long ears' > FileB
$ git add FileB
$ git hash-object FileB
34f45be4cebdae4cf67218bd47df88dcd9a4cdc6
$ git commit -m Third
[main 9ba46ad] Third
 1 file changed, 1 insertion(+)
 create mode 100644 FileB
$ tree .git/objects/
.git/objects/
├── 34
│   └── f45be4cebdae4cf67218bd47df88dcd9a4cdc6
├── 3b
│   └── cbfae6e607ef605b572f2b88ea21ad021b030b
├── 4b
│   └── 825dc642cb6eb9a060e54bf8d69288fbee4904
├── 9b
│   └── a46ad12eab0a384ebae59aa46def2bbc2b7f0a
├── af
│   └── 9df4604a35039b68625b8283d7b36fb0409136
├── c9
│   └── 8f44c0bd58f45a14f0bb29b15acd4c1616b0dc
├── e5
│   └── d8ddccedc871c546b4f6bf0e316165786c62ba
├── info
└── pack

9 directories, 7 files

We added a different file with the same content as FileA had. There's a new commit object, 9ba46ad. A new tree object, c98f44c. But it uses the same blob, 34f45be4.

$ git gc
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (10/10), done.
Total 10 (delta 0), reused 7 (delta 0), pack-reused 0
Windhund:test.git (main)$ tree .git/objects/
.git/objects/
├── info
│   ├── commit-graph
│   └── packs
└── pack
    ├── pack-4e76192447fc323d1026ae980fdbda304b70a597.idx
    └── pack-4e76192447fc323d1026ae980fdbda304b70a597.pack

2 directories, 4 files

After running git gc (Garbage Collection), Git has replaced the individual object files with more efficient packfiles.

Upvotes: 5

Related Questions