Halide: casting RGB images and parallelising blur

Question

The following code is adapted from the Halide tutorials.

Func blurX(Func continuation)
{ Var x("x"), y("y"), c("c");
  Func input_16("input_16");
  input_16(x, y, c) = cast(continuation(x, y, c));
  Func blur_x("blur_x");
  blur_x(x, y, c) = (input_16(x-1, y, c) +
                     2 * input_16(x, y, c) +
                     input_16(x+1, y, c)) / 4;
  Func output("outputBlurX");
  output(x, y, c) = cast(blur_x(x, y, c));
  return output;
}

int main()
{ Var x("x"), y("y"), c("c");
  Image input = load_image("input.png");
  Func clamped("clamped");
  clamped = BoundaryConditions::repeat_edge(input);
  Func img1Fun("img1Fun");
  Func img2Fun = blurX(clamped);
  Func outputFun("outputFun");
  /* carry on */
}

I've three questions:

Casting Is the cast cast(clamped(x, y, c)) casting the 8bit R G and B values at every (x,y) position to a 16bit integer i.e. what the cast returns is an RGB image that can be indexed e.g img1Fun(x, y, 0) to access its R value? Or is this casting every RGB pixel in the image to its luminance value between [0..1] for the RGB pixel at every (x,y) position, i.e. r*0.3 + g*0.59 + b*0.11?
Overloading RGB blur are arithmetic operations on (x,y,c) overloaded on all indexes? E.g.

(input_16(x-1, y, c) + 2 * input_16(x, y, c) + input_16(x+1, y, c)) / 4;

Is this an overloading of:

(input_16(x-1, y, 0) + 2 * input_16(x, y, 0) + input_16(x+1, y, 0)) / 4;
(input_16(x-1, y, 1) + 2 * input_16(x, y, 1) + input_16(x+1, y, 1)) / 4;
(input_16(x-1, y, 2) + 2 * input_16(x, y, 2) + input_16(x+1, y, 2)) / 4;

Parallelising how could I parallelise blurX? Based on the brighten.cpp example from CVPR'15 here, I could use blur_x.vectorize(x, 4).parallel(y); to vectorise row wise in the X direction, parallelising across threads in the Y direction.. like this?

Func blurX(Func continuation)
{ Var x("x"), y("y"), c("c");
  Func input_16("input_16");
  input_16(x, y, c) = cast(continuation(x, y, c));
  Func blur_x("blur_x");
  blur_x(x, y, c) = (input_16(x-1, y, c) +
                     2 * input_16(x, y, c) +
                     input_16(x+1, y, c)) / 4;
  blur_x.vectorize(x, 4).parallel(y);
  Func output("outputBlurX");
  output(x, y, c) = cast(blur_x(x, y, c));
  return output;
}

Zalman Stern · Accepted Answer

Question 1: A Func defines an abstract mapping from a set of coordinates to an Expr, which is a mathematical function of those coordinates. In general operators are straight forward and do not have any imaging specific behavior like conversion of a color tuple to a luminosity scalar. (To accomplish such a conversion, one must write code as the coefficients depend on the color space used.)

Hence the statement:

img1Fun(x, y, c) = cast(clamped(x, y, c));

defines input_16 as having the same number of channels as clamped but a 16-bit type instead of an 8-bit type. Arithmetic in Halide stays in the same bit width as its largest operand and unlike C is not implicitly upcasted to a standard int size. This is because with vectorization it is important to maintain a explicitly control over the lane size. In this case, using a 16-bit intermediate type is required to avoid overflow when summing 8-bit values.

There is a corresponding cast back to an 8-bit type after the division. The blurred result is guaranteed to fit in an 8-bit type as the calculation is normalized (the average value of a given color channel taken over the entire image should not change). The code above does both the upcast and and the downcast in two places, which is redundant. It likely doesn't result in any performance impact as the compiler should be smart enough to recognize the outer set of casts are nops, but it does not result in particularly readable code.

Question 2: Effectively the same answer. I would not use the term "overloading" here, but the definition applies across all coordinates. The Var "c" is mentioned on the left hand side and the right hand side and has the same value on each. (We have a shorthand underscore ('_') notation to mean "zero or more coordinates" to allow passing through an argument list, but otherwise there is nothing special in these definitions.)

Question 3: The easiest way to schedule this for vectorization and parallelization is to use a planar layout (all the R values stored next to each other, then all the G, etc.) and to vectorize to the appropriate size for 16-bit math. (E.g. "vectorize(x, natural_vector_size())" id working inside a Generator.) The thread parallelism along rows -- ".parallel(y)". Depending on the length of the rows, you may want to add a split parameters to the parallel directive.

This schedule will also work with a semi-planar representation (a row of R, a row of G, and a row of B).

There are other approaches which might make more sense when the blurX is used in the context of an actual pipeline or is a non-planar storage layout is required.

Halide: casting RGB images and parallelising blur

Answers (1)

Related Questions