Reputation: 89
As the fermi-whitepaper suggests, there are 16 SMs (Streaming Multiprocessors), whereas each of them consists of 32 cores. The gpu executes a thread of a group of 32 threads, called warp.
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
And if so, does it mean that in total the fermi-architecture allows exeucting operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
If so, how many times can it execute 512 datas in parallel in one second?
Upvotes: -1
Views: 59
Reputation: 152173
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
Yes.
And if so, does it mean that in total the fermi-architecture allows executing operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
Yes, possibly, depending on the operation type. A GPU SM includes functional units that handle different types of operations (instructions). An integer add may not be handled by the same functional unit as a floating-point add, for example. Because different operations are handled by different functional units, and due to the fact that there is no particular requirement that the GPU SM contain 32 functional units for each instruction (type), the specific throughput will depend on the instruction. However the 32 functional units you are referring to can handle a float
add, multiply, or multiply-add. So for those specific operation types, your calculation is correct.
If so, how many times can it execute 512 datas in parallel in one second?
This is given by the clock rate, divided by the number of clocks to service an instruction. For example, with 32 FP add units, the GPU can theoretically retire one of these, for 512 "datas" in a single clock cycle. If there were another operation, such as integer add, which only had 16 functional units to service it, then it would require 2 clocks to service it warp-wide. So we would divide the number by 2. And if you had a mix of operations, say 8 floating-point adds issued on 8 SMs, and 8 integer adds issued on the other 8 SMs, then you would have a more complex calculation, perhaps.
The theoretical maximum floating point throughput is computed this way. For example, the Fermi M2090 has all 16 SMs enabled, and is claimed to have a peak theoretical throughput of 1332 GF/s for FP32 ops. That calculation is as follows:
16 SMs * 32 functional units/SM * 2 ops/functional unit/hotclk * 2 hotclock/clk * 651M clks/sec = 1333GF/s FP32
Upvotes: 2