Ritesh Saha
Ritesh Saha

Reputation: 11

I am trying to implement a python3 library to create IPFS-Merkle DAGs but I have been unable to figure out the correct way to specify links?

I've been working on a Python implementation for a Merkle DAG (Directed Acyclic Graph) with the goal of creating Content Addressable Archive (CAR) files. However, I've hit a roadblock and I'm struggling to figure out the correct way to specify links in the nodes. Following is my python3 implementation.

I'm using the multiformat library to generate CIDs for each chunk of data, and then I'm trying to create a Merkle DAG where each node contains links to its children. The end goal is to produce a CAR file.

I'm storing the CIDs of the chunks in the "links" field of the root node. However, I'm unsure if this is done correctly. Are there any specific requirements for linking nodes in a IPLD Merkle DAG that I might be missing?

If anyone has experience with Merkle DAGs and CAR file creation in Python, could you please review my code and provide insights into the correct way to specify links in the nodes and generate a valid CAR file?

I appreciate any assistance or suggestions to help me move past this roadblock. Thank you!

from multiformats import CID, varint, multihash, multibase
import dag_cbor
import json
import msgpack

def generate_cid(data, codec="dag-pb"):
    hash_value = multihash.digest(data, "sha2-256")
    return CID("base32", version=1, codec=codec, digest=hash_value)

def generate_merkle_tree(file_path, chunk_size):
    cids = []

    # Read the file
    with open(file_path, "rb") as file:
        while True:
            # Read a chunk of data
            chunk = file.read(chunk_size)
            if not chunk:
                break

            # Generate CID for the chunk
            cid = generate_cid(chunk, codec="raw")
            cids.append((cid, chunk))

    # Generate Merkle tree root CID from all the chunks
    # root_cid = generate_cid(b"".join(bytes(cid[0]) for cid in cids))
    
    # Create the root node with links and other data
    root_node = {
        "file_name": "test.png",
        "links": [str(cid[0]) for cid in cids]
    }
    
    # Encode the root node as dag-pb
    root_data = dag_cbor.encode(root_node)
    
    # Generate CID for the root node
    root_cid = generate_cid(root_data, codec="dag-pb")
    
    return root_cid, cids, root_data

def create_car_file(root, cids):
    header_roots = [root]
    header_data = dag_cbor.encode({"roots": header_roots, "version": 1})
    header = varint.encode(len(header_data)) + header_data

    car_content = b""
    car_content += header
    for cid, chunk in cids:
        cid_bytes = bytes(cid)
        block = varint.encode(len(chunk) + len(cid_bytes)) + cid_bytes + chunk
        car_content += block
    
    root_cid = bytes(root)
    root_block = varint.encode(len(root_cid)) + root_cid
    car_content += root_block
    with open("output.car", "wb") as car_file:
        car_file.write(car_content)

file_path = "./AADHAAR.png"  # Replace with the path to your file
chunk_size = 16384  # Adjust the chunk size as needed

root, cids, root_data = generate_merkle_tree(file_path, chunk_size)
print(root)
create_car_file(root, cids)

I've been working on a Python implementation to create a Merkle DAG and subsequently generate a Content Addressable Archive (CAR) file.

I attempted to link nodes by storing the CIDs of the chunks in the "links" field of the root node. However, I'm uncertain if I'm doing this correctly. My expectation was that each node would contain links to its children, but I'm unsure if there are specific requirements for linking nodes in a IPLD Merkle DAG.

Upvotes: 1

Views: 134

Answers (1)

lidel
lidel

Reputation: 555

Files and Directories are represented in IPFS using something we call UnixFS abstraction, which uses dag-pb (protobuf) blocks for DAG metadata, and raw blocks for the leaves at the bottom of the DAG (raw user data without envelopes.

There are key things to be aware, when you create UnixFS DAG from scratch:

  • If you have more than one block, the root should be dag-pb.

    • Your code sample encoded root block as dag-cbor → use dag-pb instead
  • When you generate a dag-pb block for UnixFS that is supposed to link to other UnixFS blocks, you need to follow dag-pb serial format defined in https://ipld.io/specs/codecs/dag-pb/spec/ and implement message PBNode. It includes links, but also has Data field which requires additional handling.

    • The Data field inside of dag-pb is another protobuf structure that you need to implement.
  • Valid UnixFS root node has both links and byte ranges for children in blocksizes:

    When reading a file represented with dag-pb, the blocksizes array gives us the size in bytes of the partial file content present in children DAGs. Each index in PBNode.Links MUST have a corresponding chunk size stored at the same index in decode(PBNode.Data).blocksizes.

  • Also, if it helps, there is a bunch of fixtures in https://ipld.io/specs/codecs/dag-pb/fixtures/cross-codec/

  • Older DAGs, created with CIDv0 do not use raw blocks for leaves, and have data wrapped in dag-pb envelope as well. Make sure you support both leaf types.

  • Block limit and HAMT-sharding: Remember that there is a block limit around 2MiB mark in various IPGFFS transports such as bitswap over libp2p, which means you should limit the size of your blocks if you want them to be portable.

    • If you have a big file/directory, and the root block with all links gets bigger than 2MiB, you need to implement HAMT-sharding (HAMTShard UnixFS dag-pb node type).
    • Example: bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze is a big DAG, the /wiki subdirectory has with milions of files. You can see how HAMT-abstraction looks like here. HAMT abstraction is hidden from end user, see ipfs ls bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze.

Upvotes: 1

Related Questions