Reputation: 8632
I have no experience with SIMD, but have a method that is too slow. I know get 40fps, and I need more. Does anyone know how I could make this paint method faster? Perhaps the SIMD instructions are a solution?
The sourceData is now a byte[] (videoBytes) but could use a pointer too.
public bool PaintFrame(IntPtr layerBuffer, ushort vStart, byte vScale)
{
for (ushort y = 0; y < height; y++)
{
ushort eff_y = (ushort)(vScale * (y - vStart) / 128);
var newY = tileHeight > 0 ? eff_y % tileHeight : 0;
uint y_add = (uint)(newY * tileWidth * bitsPerPixel >> 3);
for (int x = 0; x < width; x++)
{
var newX = tileWidth > 0 ? x % tileWidth : 0;
ushort x_add = (ushort)(newX * bitsPerPixel >> 3);
uint tile_offset = y_add + x_add;
byte color = videoBytes[tile_offset];
var colorIndex = BitsPerPxlCalculation(color, newX);
// Apply Palette Offset
if (paletteOffset > 0)
colorIndex += paletteOffset;
var place = x + eff_y * width;
Marshal.WriteByte(layerBuffer + place, colorIndex);
}
}
return true;
}
private void UpdateBitPerPixelMethod()
{
// Convert tile byte to indexed color
switch (bitsPerPixel)
{
case 1:
BitsPerPxlCalculation = (color, newX) => color;
break;
case 2:
BitsPerPxlCalculation = (color, newX) => (byte)(color >> 6 - ((newX & 3) << 1) & 3);
break;
case 4:
BitsPerPxlCalculation = (color, newX) => (byte)(color >> 4 - ((newX & 1) << 2) & 0xf);
break;
case 8:
BitsPerPxlCalculation = (color, newX) => color;
break;
}
}
More info
Depending on the settings, the bpp can be changed. The indexed colors and the palette colors are separatly stored. Here I have to recreate the image pixels indexes, so later on I use the palette and color indexes in WPF(Windows) or SDL(Linux, Mac) to display the image.
vStart is the ability to crop the image on top.
The UpdateBitPerPixelMethod() will not change during a frame rendering, only before. During the for, no settings data can be changed.
So I was hoping that some parts can be written with SIMD, because the procedure is the same for all pixels.
Upvotes: 0
Views: 188
Reputation: 396
Hy,
your code is not the clearest to me. Are you trying to create a new matrix / image ? If yes create a new 2D allocation and calculate the entire image into it. Set it to 0 after you do not need the calculations anymore.
Replace the Marshal.WriteByte(layerBuffer + place, colorIndex);
with a 2D image ( maybe this is the image ?).
Regarding the rest it is a problem because you have non uniform offsets in indexing and jumps. That will make developing a SIMD solution difficult (you need masking and stuff). My bet would be to calculate everything for all the indices and save it into individual 2D matrices, that are allocated once at the begining.
For example:
ushort eff_y = (ushort)(vScale * (y - vStart) / 128);
Is calculated per every image row. Now you could calculate it once as an array since I do not believe that the format size of the images changes during the run.
I dont know if vStart and vScale are defined as a constant at program start. You should do this for every calculation that uses constant, and just read the matrices later to calculate.
SIMD can help but only if you do every iteration you calculate the same thing and if you avoid branching and switch cases.
Addition 1
You have multiple problems and design considerations from my stand point. First of all you need to get away from the idea SIMD is going to help in your case. You would need to remove all conditional statements. SIMD-s are not build to deal with conditional statements.
Your idea should be to split up the logic into manageable pieces so you can see witch piece of the code takes most time. One big problem is the write byte in the marshal, this is automatically saying to the compiler that you handle only and exclusively 1 byte. I'm guessing that this creates on big bottle neck.
By code analysis I see in each loop you are doing checks. This must be restructured.
Assumption is the image get rarely cropped this would be a separation from the image calculations.
List<ushort> eff_y = new List<ushort>();
List<uint> y_add = new List<uint>();
for (ushort y = 0; y < height; y++)
{
eff_y.add((ushort)(vScale * (y - vStart) / 128));
var newY = tileHeight > 0 ? eff_y % tileHeight : 0;
y_add = (uint)(newY * tileWidth * bitsPerPixel >> 3);
}
So this can be precalculated and changed only when the cropping changes.
Now it gets realy tricky.
paletteOffset - the if statement makes only sense in paletteOffset can be negative, then zero it out and remove the if statement
bitsPerPixel - this looks like a fixed value for the rendering duration so remove the UpdateBitPerPixelMethod and send in a parameter.
for (ushort y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
var newX = tileWidth > 0 ? x % tileWidth : 0; // conditional stetement
ushort x_add = (ushort)(newX * bitsPerPixel >> 3);
uint tile_offset = y_add + x_add;
byte color = videoBytes[tile_offset];
var colorIndex = BitsPerPxlCalculation(color, newX);
// Apply Palette Offset
if (paletteOffset > 0) // conditional stetement
colorIndex += paletteOffset;
var place = x + eff_y * width;
Marshal.WriteByte(layerBuffer + place, colorIndex);
}
}
This are only few things that need to be done before you try anything with the SIMD. But by that time the changes will give the compiler hints about what you want to do. This could improve the machine code execution. You need also to test the performance of your code to pinpoint the bottle neck it is very hard to assume or guess correctly by code.
Good luck
Upvotes: 3