Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together

Question

I am working my way through the book "Building Git", which goes through building Git with Ruby. I decided to write it in python while still following along in the book.

The author uses a function defined in ruby Array#pack to pack a git tree object. Git uses binary representation for the 40 character blob hash to reduce it to 20 bytes. In the authors words:

Putting everything together, this generates a string for each entry consisting of the mode 100644, a space, the filename, a null byte, and then twenty bytes for the object ID. Ruby’s Array#pack supports many more data encodings and is very useful for generating binary representations of values. If you wanted to, you could implement all the maths for reading pairs of digits from the object ID and turning each pair into a single byte, but Array#pack is so convenient that I usually reach for that first.

He uses the following code to implement this:

def to_s
    entries = @entries.sort_by(&:name).map do |entry|
      ["#{ MODE } #{ entry.name }", entry.oid].pack(ENTRY_FORMAT)
    end

with ENTRY_FORMAT = "Z*H40" and MODE = "100644". entry is class that has :name and :oid attributes, representing the name and the SHA1 hash of a filename.

The goal is also explained by the author:

Putting everything together, this generates a string for each entry consisting of the mode 100644, a space, the filename, a null byte, and then twenty bytes for the object ID. Ruby’s Array#pack supports many more data encodings and is very useful for generating binary representations of values. If you wanted to, you could implement all the maths for reading pairs of digits from the object ID and turning each pair into a single byte, but Array#pack is so convenient that I usually reach for that first.

And the format "Z*H40" means the following:

Our usage here consists of two separate encoding instructions:

Z*: this encodes the first string, "#{ MODE } #{ entry.name }", as an arbitrary-length null- padded string, that is, it represents the string as-is with a null byte appended to the end

H40: this encodes a string of forty hexadecimal digits, entry.oid, by packing each pair of digits into a single byte as we saw in Section 2.3.3, “Trees on disk”

I have tried for many hours to replicate this in python using struct.pack and other various methods, but either i am not getting the format correct, or I am just missing something very obvious. In any case, this is what I currently have:

def to_s(self):
      entries = sorted(self.entries, key=lambda x: x.name)

      entries = [f"{self.MODE} {entry.name}" + entry.oid.encode() for entry in entries]
      packed_entries = b"".join(pack("!Z*40s", entry) for entry in entries)

      return packed_entries

but obviously this will give a concat error from bytes() to str().

Traceback (most recent call last):
  File "jit.py", line 67, in 
    database.store(tree)
  File "/home/maslin/jit/pyJit/database.py", line 12, in store
    string = obj.to_s()
  File "/home/maslin/jit/pyJit/tree.py", line 40, in to_s
    entries = [f"{self.MODE} {entry.name}" + entry.oid.encode() for entry in entries]
  File "/home/maslin/jit/pyJit/tree.py", line 40, in 
    entries = [f"{self.MODE} {entry.name}" + entry.oid.encode() for entry in entries]
TypeError: can only concatenate str (not "bytes") to str

So then I tried to keep everything as a string, and tried using struct.pack to format it for me, but it gave me a struct.error: bad char in struct format error.

def to_s(self):
      entries = sorted(self.entries, key=lambda x: x.name)

      entries = [f"{self.MODE} {entry.name}" + entry.oid for entry in entries]
      packed_entries = b"".join(pack("!Z*40s", entry) for entry in entries)

      return packed_entries

And the traceback:

Traceback (most recent call last):
  File "jit.py", line 67, in 
    database.store(tree)
  File "/home/maslin/jit/pyJit/database.py", line 12, in store
    string = obj.to_s()
  File "/home/maslin/jit/pyJit/tree.py", line 41, in to_s
    packed_entries = b"".join(pack("!Z*40s", entry) for entry in entries)
  File "/home/maslin/jit/pyJit/tree.py", line 41, in 
    packed_entries = b"".join(pack("!Z*40s", entry) for entry in entries)
struct.error: bad char in struct format

How can I pack a string for each entry consisting of the mode 100644, a space, the filename, a null byte, and then twenty bytes for the object ID?

The author notes above that this can be done by "implementing all the maths for reading pairs of digits from the object ID and turning each pair into a single byte", so if your solution involves this method, that is also ok.

P.S. this question did not help me nor did this.

P.P.S. ChatGPT was no help as well

Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together

Answers (1)

Related Questions

Python equivalent of Ruby&#39;s Array#pack, how to pack unknown string length and bytes together

Answers (1)

Related Questions

Python equivalent of Ruby's Array#pack, how to pack unknown string length and bytes together