Reputation: 88
I attempted to write a sort algorithm to reorder instructions for a dual issue processor (Cell SPU). One way to obtain dual issue processing an instruction should not depend on the instruction that precedes it (another involves separate pipelines, but I'm focused on instructions in the same pipeline). I understand this would be too much for the compiler to do and I didn't find what I needed when searching. This can be done by hand in most cases but a sorting algorithm should ensure the lowest "sequence count" (number or dependent instructions that follow each-other).
My question is has this or something similar been done before? Is there an optimized approach?
Simple example pseudo-code halving instruction time (inputs: i1, i2, i3
):
v1 = i1 ^ i2; - #single-issued
v2 = v1 | i2; \ #v2,v3 dual-issued
v3 = i1 & i3; / #v2,v3 dual-issued
v4 = v3 & i2; - #single-issued
can be written as:
v1 = i1 ^ i2; \ #v1,v3 dual-issued
v3 = i1 & i3; / #v1,v3 dual-issued
v2 = v1 | i2; \ #v2,v4 dual-issued
v4 = v3 & i2; / #v2,v4 dual-issued
Here is a python implementation I created that recursively reorders the instructions to achieve the lowest "sequence count".
reorder.py
http://pastebin.com/dt8eWy3H
sample t8-1.h
http://pastebin.com/w0DYg8ff
Upvotes: 1
Views: 296
Reputation: 88
I ended up using the 'Assembly Visualizer' asmVis.jar
java program to view the sections of assembly that could be optimized, and I reordered the instructions manually. I increased the speed of the assembly function a great deal using both the odd and even side during almost every instruction cycle (dual issued instructions).
TODO: add github link to source
Upvotes: 1
Reputation: 3987
While I can't speak specifically for the Cell, code scheduling is ABSOLUTELY something the compiler should be doing for you.
Compilers will re-order instructions, pad in NOPS as required, and do everything it can to provide a good code schedule for you. Normally, I'd tell you to look at the "mtune" parameters for your compiler (they allow you to tell your compiler exactly what your processor looks like), but since you're coding for the Cell it should already know what to do (but do check the compiler manual just to be sure).
A brief glance at the GCC compiler for the SPU here shows options such as:
-mdual-nops=n
By default, GCC inserts nops to increase dual issue when
it expects it to increase performance. n can be a value from
0 to 10. A smaller n inserts fewer nops. 10 is the default, 0
is the same as -mno-dual-nops. Disabled with -Os. `
As a programmer, it is your job to provide enough "ILP" in your code to get good scheduling. Try to avoid branches, avoid putting long latency operations on the critical path, etc., and you should be fine. Analyze the objdump of your critical loops to verify the code is being scheduled as you desire. The compiler is very smart, but it may require a little coaxing.
Upvotes: 1