Reputation: 11
I'm an undergraduate student studying a section in our computer architecture textbook that introduces a hardware-level parallel multiplication algorithm. As depicted in the figure, the algorithm involves a sequential peeling process during the addition phase, where bits are "peeled off" from both front and back in the order of 1, 1, 2, 4, 8, 16, going directly to the output from a higher level of the tree.
This suggests that we can accomplish 64-bit multiplication after log_2(64)
iterations with just a 64-bit adder.
While I understand the "peeling off" process for the lower bits, I'm unsure if this process is equally valid for the higher bits. Should we take into account the effect of carry generation? This leads me to suspect that there might be some inaccuracies in this book's content.
To illustrate my point, consider the multiplication of 4'b1111
and 4'b1000
. The result should be 8'b01111000
, but it seems impossible to peel off a '0' in the first step in this case. Could someone clarify this for me?
Upvotes: 1
Views: 54
Reputation: 364180
The top two partial-products from 4'b1111
x 4'b1000
are
8'b01111000
and
8'b00000000
(left shifted copies of 1111
times 0 or 1).
Or if you do it the other way,
8'b01000000
and
8'b00100000
Neither of these carry into the high bit when you add them, and all the partial products have a leading 0
, so that's not a counter-example for the first step.
The problem case for that diagram is that they're peeling off the high bit of the result before even the first step of adding, so it's always going to be 0? Or has no dependency on lower bits if they're looking at a lower bit. But 0b1111 * 0b1111
is 0b11100001
which has its high bit set.
So I agree with you, this doesn't seem to make sense.
It possibly could work if the first "peeled" bit was from after the first addition, since adding shifted copies of the same number might make carry-out from the very top impossible if it didn't happen in the first step. (But I'm not sure if that's actually true and haven't tried some test cases with e.g. 0b1011
x 0b1111
, or other cases with only one zero between set bits.)
This suggests that we can accomplish 64-bit multiplication after
log_2(64)
iterations with just a 64-bit adder.
No, not just 6 uses of a single 64-bit adder! They're not saying you need fewer adders or add operations, they're using the associativity of integer addition to do (a+b) + (c+d)
instead of sequential, so latency is lower for the same amount of work. So they still have 63x 64-bit adders, arranged in a tree instead of a single total += partial_product[i]
dependency chain.
It is a full multiply so they are managing to produce a 128-bit result with only 63x 64-bit adders, though.
Fun fact: you can do even better than what they're showing by not fully propagating carry at every add step, instead using adders that take inputs with bit and carry, or something like that IIRC. The key being that each level of the tree only has O(1)
latency, not O(log n)
for an n-bit adder (using carry-lookahead or similar tricks.)
Upvotes: 0