Reputation: 11
I've been working on a Python implementation for a Merkle DAG (Directed Acyclic Graph) with the goal of creating Content Addressable Archive (CAR) files. However, I've hit a roadblock and I'm struggling to figure out the correct way to specify links in the nodes. Following is my python3 implementation.
I'm using the multiformat library to generate CIDs for each chunk of data, and then I'm trying to create a Merkle DAG where each node contains links to its children. The end goal is to produce a CAR file.
I'm storing the CIDs of the chunks in the "links" field of the root node. However, I'm unsure if this is done correctly. Are there any specific requirements for linking nodes in a IPLD Merkle DAG that I might be missing?
If anyone has experience with Merkle DAGs and CAR file creation in Python, could you please review my code and provide insights into the correct way to specify links in the nodes and generate a valid CAR file?
I appreciate any assistance or suggestions to help me move past this roadblock. Thank you!
from multiformats import CID, varint, multihash, multibase
import dag_cbor
import json
import msgpack
def generate_cid(data, codec="dag-pb"):
hash_value = multihash.digest(data, "sha2-256")
return CID("base32", version=1, codec=codec, digest=hash_value)
def generate_merkle_tree(file_path, chunk_size):
cids = []
# Read the file
with open(file_path, "rb") as file:
while True:
# Read a chunk of data
chunk = file.read(chunk_size)
if not chunk:
break
# Generate CID for the chunk
cid = generate_cid(chunk, codec="raw")
cids.append((cid, chunk))
# Generate Merkle tree root CID from all the chunks
# root_cid = generate_cid(b"".join(bytes(cid[0]) for cid in cids))
# Create the root node with links and other data
root_node = {
"file_name": "test.png",
"links": [str(cid[0]) for cid in cids]
}
# Encode the root node as dag-pb
root_data = dag_cbor.encode(root_node)
# Generate CID for the root node
root_cid = generate_cid(root_data, codec="dag-pb")
return root_cid, cids, root_data
def create_car_file(root, cids):
header_roots = [root]
header_data = dag_cbor.encode({"roots": header_roots, "version": 1})
header = varint.encode(len(header_data)) + header_data
car_content = b""
car_content += header
for cid, chunk in cids:
cid_bytes = bytes(cid)
block = varint.encode(len(chunk) + len(cid_bytes)) + cid_bytes + chunk
car_content += block
root_cid = bytes(root)
root_block = varint.encode(len(root_cid)) + root_cid
car_content += root_block
with open("output.car", "wb") as car_file:
car_file.write(car_content)
file_path = "./AADHAAR.png" # Replace with the path to your file
chunk_size = 16384 # Adjust the chunk size as needed
root, cids, root_data = generate_merkle_tree(file_path, chunk_size)
print(root)
create_car_file(root, cids)
I've been working on a Python implementation to create a Merkle DAG and subsequently generate a Content Addressable Archive (CAR) file.
I attempted to link nodes by storing the CIDs of the chunks in the "links" field of the root node. However, I'm uncertain if I'm doing this correctly. My expectation was that each node would contain links to its children, but I'm unsure if there are specific requirements for linking nodes in a IPLD Merkle DAG.
Upvotes: 1
Views: 134
Reputation: 555
Files and Directories are represented in IPFS using something we call UnixFS abstraction, which uses dag-pb
(protobuf) blocks for DAG metadata, and raw
blocks for the leaves at the bottom of the DAG (raw user data without envelopes.
There are key things to be aware, when you create UnixFS DAG from scratch:
If you have more than one block, the root should be dag-pb
.
dag-cbor
→ use dag-pb
insteadWhen you generate a dag-pb
block for UnixFS that is supposed to link to other UnixFS blocks, you need to follow dag-pb
serial format defined in https://ipld.io/specs/codecs/dag-pb/spec/ and implement message PBNode
. It includes links, but also has Data
field which requires additional handling.
Data
field inside of dag-pb
is another protobuf structure that you need to implement.Valid UnixFS root node has both links and byte ranges for children in blocksizes
:
When reading a file represented with
dag-pb
, theblocksizes
array gives us the size in bytes of the partial file content present in children DAGs. Each index inPBNode.Links
MUST have a corresponding chunk size stored at the same index indecode(PBNode.Data).blocksizes
.
message Data
protobuf
definition there, it should be enough to get you started.Also, if it helps, there is a bunch of fixtures in https://ipld.io/specs/codecs/dag-pb/fixtures/cross-codec/
Older DAGs, created with CIDv0 do not use raw
blocks for leaves, and have data wrapped in dag-pb
envelope as well. Make sure you support both leaf types.
Block limit and HAMT-sharding: Remember that there is a block limit around 2MiB mark in various IPGFFS transports such as bitswap over libp2p, which means you should limit the size of your blocks if you want them to be portable.
HAMTShard
UnixFS dag-pb node type).bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
is a big DAG, the /wiki
subdirectory has with milions of files. You can see how HAMT-abstraction looks like here. HAMT abstraction is hidden from end user, see ipfs ls bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze
.Upvotes: 1