Memento mori.
Memento mori.

Reputation: 11

Is this hardware-level parallel multiplication algorithm correct?

I'm an undergraduate student studying a section in our computer architecture textbook that introduces a hardware-level parallel multiplication algorithm. As depicted in the figure, the algorithm involves a sequential peeling process during the addition phase, where bits are "peeled off" from both front and back in the order of 1, 1, 2, 4, 8, 16, going directly to the output from a higher level of the tree.

This suggests that we can accomplish 64-bit multiplication after log_2(64) iterations with just a 64-bit adder.

While I understand the "peeling off" process for the lower bits, I'm unsure if this process is equally valid for the higher bits. Should we take into account the effect of carry generation? This leads me to suspect that there might be some inaccuracies in this book's content.

image link

To illustrate my point, consider the multiplication of 4'b1111 and 4'b1000. The result should be 8'b01111000, but it seems impossible to peel off a '0' in the first step in this case. Could someone clarify this for me?

Upvotes: 1

Views: 54

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 364180

The top two partial-products from 4'b1111 x 4'b1000 are
8'b01111000 and
8'b00000000 (left shifted copies of 1111 times 0 or 1).

Or if you do it the other way,
8'b01000000 and
8'b00100000

Neither of these carry into the high bit when you add them, and all the partial products have a leading 0, so that's not a counter-example for the first step.

The problem case for that diagram is that they're peeling off the high bit of the result before even the first step of adding, so it's always going to be 0? Or has no dependency on lower bits if they're looking at a lower bit. But 0b1111 * 0b1111 is 0b11100001 which has its high bit set.

So I agree with you, this doesn't seem to make sense.

It possibly could work if the first "peeled" bit was from after the first addition, since adding shifted copies of the same number might make carry-out from the very top impossible if it didn't happen in the first step. (But I'm not sure if that's actually true and haven't tried some test cases with e.g. 0b1011 x 0b1111, or other cases with only one zero between set bits.)


This suggests that we can accomplish 64-bit multiplication after log_2(64) iterations with just a 64-bit adder.

No, not just 6 uses of a single 64-bit adder! They're not saying you need fewer adders or add operations, they're using the associativity of integer addition to do (a+b) + (c+d) instead of sequential, so latency is lower for the same amount of work. So they still have 63x 64-bit adders, arranged in a tree instead of a single total += partial_product[i] dependency chain.

It is a full multiply so they are managing to produce a 128-bit result with only 63x 64-bit adders, though.


Fun fact: you can do even better than what they're showing by not fully propagating carry at every add step, instead using adders that take inputs with bit and carry, or something like that IIRC. The key being that each level of the tree only has O(1) latency, not O(log n) for an n-bit adder (using carry-lookahead or similar tricks.)

Upvotes: 0

Related Questions