alper
alper

Reputation: 3410

Efficiency of IPFS for sharing updated file

Update 10-30-2019:

=> Please see the following discussion for feature request to IPFS: git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication

=> Please see th following discussion for additional information. Does IPFS provide block-level file copying feature?


For example userA added a file sized 1 GB. IPFS add file.txt and userB get that file into his storage through IPFS. Later userA released a mistake and changed only a single character on the file and wants to share this updated version with userB.

So userA again added same file with a small change into IPFS via ipfs add file, and userB have to fetch that 1 GB of file instead of updating that single character. Is there any better approach to solve this issue, where only the updated version should be pulled by userB like how git works when we do git pull?

Git have much better approach please see (https://stackoverflow.com/a/8198276/2402577). Does IPFS uses delta compression for storage (https://gist.github.com/matthewmccullough/2695758) like Git? or similar approach?

Further investigation:

I did a small experiment. First I have added 1GB file into IPFS. Later, I have updated a small line on the file, that is already shared via IPFS. I observe that userA pushes complete 1GB file all over again, instead only pushing the block that contains the changed data. That is very expensive and time consuming in my opinion. I have shared the hash of the new updated file and again complete file is downloaded via IPFS on userB instead of downloaded only the block that contains the changed character.

userA

$ fallocate -l 1G gentoo_root.img
$ ipfs add gentoo_root.img
 920.75 MB / 1024.00 MB [========================================>----]  89. 92added QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm gentoo_root.img

userB

$ ipfs get QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm
Saving file(s) to QmdiETTY5fiwTkJeERbWAbPKtzcyjzMEJTJJosrqo2qKNm
 1.00 GB / 1.00 GB [==================================] 100.00% 49s

userA

$ echo 'hello' >> gentoo_root.img
$  ipfs add gentoo_root.img   # HERE node pushing 1 GB file into IPFS again. It took 1 hour for me to push it, instead only updated the changed block.
32.75 MB / 1.00 GB [=>---------------------------------------]   3.20% 1h3m34s
added Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3 gentoo_root.img

userB

# HERE complete 1 GB file is downloaded all over again.
ipfs get Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
[sudo] password for alper:
Saving file(s) to Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 1.00 GB / 1.00 GB [=========================] 100.00% 45s

[Q] At this point what is the best solution via IPFS to share the updated file without re-sharing the whole version of the updated file and for IPFS to share only the updated blocks of the file?


In addition to that; on the same node whenever I do ipfs cat <hash> it keep downloads same hash all over again.

$ ipfs cat Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 212.46 MB / 1.00 GB [===========>---------------------------------------------]  20.75% 1m48s

$ ipfs cat Qmew8yVjNzs2r54Ti6R64W9psxYFd16X3yNY28gZS4YeM3
 212.46 MB / 1.00 GB [===========>---------------------------------------------]  20.75% 1m48s

Analyse:

Both (updated and original file) have the same increase on the repo size:

First I create 100 MB file ( file.txt)

NumObjects: 5303
RepoSize:   181351841
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

   $ ipfs add file.txt
   added QmZ33LSByGsKQS8YRW4yKjXLUam2cPP2V2g4PVPVwymY16 file.txt
   $ ipfs pin add QmZ33LSByGsKQS8YRW4yKjXLUam2cPP2V2g4PVPVwymY16

Here number of objects increased 4. Changed repo size (37983)

$ ipfs repo stat
NumObjects: 5307
RepoSize:   181389824
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

Than I did echo 'a' >> file.txt then ipfs add file.txt

Here I observe that number of objects increased 4 more so it added the complete file, changed repo size (38823)

$ ipfs repo stat
NumObjects: 5311
RepoSize:   181428647
StorageMax: 10000000000
RepoPath:   /home/alper/.ipfs
Version:    fs-repo@6

Upvotes: 3

Views: 1153

Answers (3)

Donovan Baarda
Donovan Baarda

Reputation: 481

IPFS supports rabin chunking, which is a magic way to break a large file up into blocks where the block boundaries happen at the same places in any identical sequences of data, regardless of the alignment of that data. This means the block sizes are variable, and adding a single byte at the start of a large file will typically result in the first block being one byte larger, and all the other blocks being identical.

So rabin chunking will result in IPFS efficiently reusing blocks in large files with only small changes.

However, you should also be aware that things like compression typically mean a single byte change in an input file result in nearly every byte changing in a compressed output file. This means a small change in a file that is compressed will typically not have any reusable blocks regardless of how you chunk them.

This is why rsync cannot normally efficiently update *.gz files. However gzip has an --rsyncable option that will sacrifice a small amount of compression to minimize changes in the compressed output. Interestingly it uses something very similar to rabin chunking, but I think it predates rabin. Using gzip --rsyncable for compressed files that are added to IPFS using rabin chunking will result in sharing blocks with other similarly compressed/added but slightly different files.

Upvotes: 2

Sandeep Yadav
Sandeep Yadav

Reputation: 46

Files in IPFS are content-addressed and immutable, they can be complicated to edit.But there is MFS(Mutable File System) that can be used to treat files like you would a normal name-based filesystem — you can add, remove, move, and edit MFS files and have all the work of updating links and hashes taken care of for you.

Upvotes: 0

Matthew Steeples
Matthew Steeples

Reputation: 8058

IPFS isn't currently designed to support the scenario you're describing, because the files are indexed by a hash of their contents. There are circumstances where this will work "accidentally" though due to the way that files are broken down into chunks. If the change happens at the end of the file, then there is a possibility that the start of the file will have the same hashes for the "blocks" that are transferred.

It would be possible to analyse the data that is currently stored and see if you already have something that could be used for the block (which is a similar method to how rsync achieves this, although it uses a checksum algorithm designed for that process)

Upvotes: 3

Related Questions