Is there a fast way to add two three dimensional arrays?

Question

The three lines of code adding the masking [height, width, 1] to the R, G, B also [height, width, 1] drag the runtime of this code from less than a second to 5 - 10 minutes.

Is there a better way to add two numpy matrices? I know it is from the addition process because when I take it out it runs significantly faster again. Any insight into why this is going so slow?

This is an RGB color Mat. It is broken up into small regions called superpixels which are just groupings of pixels. I'm trying to take the average of all the pixels in each grouping and fill that grouping with that single color. At first execution, this worked perfectly finishing the picture in less than a second. However, all blacks were taken out. In order to fix this issue, I decided to add 1's where the superpixel masking is but is still zero, this way I can account for black pixels in the average.

import cv2 as cv
import os
import numpy as np
img = cv.imread(path+item)
f, e = os.path.splitext(path+item)

for x in range(0, 3):
    img = cv.pyrDown(img)

height, width, channel = img.shape

img_super = cv.ximgproc.createSuperpixelSLIC(img, cv.ximgproc.MSLIC, 100, 10.0)
img_super.iterate(3)

labels = np.zeros((height, width), np.int32)
labels = img_super.getLabels(labels)

super_pixelized = np.zeros_like(img)

print("Check-1")

for x in range(0, img_super.getNumberOfSuperpixels()):
    new_img = img.copy()
    #print(new_img.shape)

    mask = cv.inRange(labels, x, x)

    new_img = cv.bitwise_and(img, new_img, mask=mask)

    r, g, b = np.dsplit(new_img, 3)

    print("Check-2")
    basis = np.expand_dims(mask, 1)

    basis = basis * 1/255

    print(basis)

    r = np.add(r, basis)
    g = np.add(g, basis)
    b = np.add(b, basis)

    r_avg = int(np.sum(r)/np.count_nonzero(r))
    g_avg = int(np.sum(g)/np.count_nonzero(g))
    b_avg = int(np.sum(b)/np.count_nonzero(b))

    #print(r_avg)
    #print(g_avg)
    #print(b_avg)

    r = mask * r_avg
    g = mask * g_avg
    b = mask * b_avg

    final_img = np.dstack((r, g, b))

    super_pixelized = cv.bitwise_or(final_img, super_pixelized)

This simple adding procedure caused the code runtime to increase significantly.

Dan Mašek · Accepted Answer

Fixing the Slow-Down

The specific issue that slows down your program lies in the call to np.expand_dims(...):

basis = np.expand_dims(mask, 1)

The second parameters is "position in the expanded axes where the new axis is placed." Since mask has 2 axes at that point, you insert a new axis in between the first an the second.

For example:

>>> import numpy as np
>>> mask = np.zeros((240, 320), np.uint8)
>>> mask.shape
(240L, 320L)
>>> expanded = np.expand_dims(mask, 1)
>>> expanded.shape
(240L, 1L, 320L)

We get an image of shape (240L, 1L, 320L) where we really want (240L, 320L, 1L).

Later on, you add this miss-shaped array to each of the split channel images, which have a shape of (240L, 320L, 1L).

>>> img = np.zeros((240,320,3), np.uint8)
>>> r, g, b = np.dsplit(img, 3)
>>> r.shape
(240L, 320L, 1L)
>>> r = np.add(r, expanded)
>>> r.shape
(240L, 320L, 320L)

Due to the how numpy broadcasting rules work, you end up with a 320-channel image (instead of 1-channel).

That's several orders of magnitude larger number of values to process in the further steps, hence such drastic slow-down.

The fix is simple, just add the axis in the right place:

basis = np.expand_dims(mask, 2)

That will fix the slow-down, however, there are many more issues to solve and potential optimizations to do.

Preparing for Optimization

Since we're interested in the performance of the code that processes the labels, let's refactor a bit, and make a simple test harness containing all the common bits, along with means to time individual steps and report the timings:

File superpix_harness.py

import cv2
import time

def run_test(superpix_size, fn, file_name_in, reduce_count, file_name_out):
    times = []
    times.append(time.time())

    img = cv2.imread(file_name_in)
    for _ in range(0, reduce_count):
        img = cv2.pyrDown(img)
    times.append(time.time())

    img_super = cv2.ximgproc.createSuperpixelSLIC(img, cv2.ximgproc.MSLIC, superpix_size, 10.0)
    img_super.iterate(3)
    labels = img_super.getLabels()
    superpixel_count = img_super.getNumberOfSuperpixels()    
    times.append(time.time())

    super_pixelized = fn(img, labels, superpixel_count)
    times.append(time.time())   

    cv2.imwrite(file_name_out, super_pixelized)
    times.append(time.time())

    return (img.shape, superpix_size, superpixel_count, times)

def print_header():
    print "Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total"

def print_report(test_result):
    shape, sp_size, sp_count, times = test_result
    print "%d , %d , %d , %d" % (shape[0], shape[1], sp_size, sp_count),
    for i in range(len(times) - 1):
        print (", %0.4f" % (times[i+1] - times[i])),
    print ", %0.4f" % (times[-1] - times[0])

def measure_fn(fn):
    print_header()
    for reduction in [3,2,1,0]:
        for sp_size in [100,50,25,12]:
            print_report(run_test(sp_size, fn, 'barrack.jpg', reduction, 'output_%01d_%03d.jpg' % (reduction, sp_size)))

By chance, this turned out to be the first large enough image I came upon to test this with (barrack.jpg):

Baseline

Ok, so, let's refactor your processing code into a standalone function and clean it up a bit in the process.

First of all, note that since we're in Python, we're not talking about a Mat, but rather a numpy.ndarray. Another think to keep in mind, that OpenCV uses BGR color format by default, so the variables should be renamed appropriately.

The initial copy of the input image you make with new_img = img.copy() is useless, since you overwrite it soon enough. Let's drop that and just do new_img = cv.bitwise_and(img, img, mask=mask).

Now, we'll need to look at what lead you into this conundrum in the first place. After masking out the label-specific area, you calculate the average intensity as

b_avg = int(np.sum(b) / np.count_nonzero(b))

You identified the problem correctly -- while the count of non-zero pixels correctly discounts anything that doesn't belong to the current label, it also discounts any zero-valued pixels that do belong to it (and thus throwing off the resulting mean).

There's a much simpler fix compared to what you've tried -- simply divide by the count of non-zero pixels in the mask (and reuse this count in all 3 calculations).

Finally, we can take advantage of numpy indexing to write the channel mean colour only to the masked pixels (e.g. b[mask != 0] = b_avg).

File op_labels.py

import cv2
import numpy as np

def super_pixelize(img, labels, superpixel_count):
    result = np.zeros_like(img)

    for x in range(superpixel_count):
        mask = cv2.inRange(labels, x, x)
        new_img = cv2.bitwise_and(img, img, mask=mask)

        r, g, b = np.dsplit(new_img, 3)

        label_pixel_count = np.count_nonzero(mask)
        b_avg = np.uint8(np.sum(b) / label_pixel_count)
        g_avg = np.uint8(np.sum(g) / label_pixel_count)
        r_avg = np.uint8(np.sum(r) / label_pixel_count)

        b[mask != 0] = b_avg
        g[mask != 0] = g_avg
        r[mask != 0] = r_avg

        final_img = np.dstack((r, g, b))

        result = cv2.bitwise_or(final_img, result)

    return result

Now we can measure the performance of our code.

Benchmark script:

from superpix_harness import *

import op_labels    
measure_fn(op_labels.super_pixelize)

Timings:

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1490 , 0.0590 , 0.3990 , 0.0070 , 0.6140
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0670 , 1.4510 , 0.0070 , 1.6740
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0720 , 3.6580 , 0.0080 , 3.8860
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0860 , 7.7170 , 0.0070 , 7.9590
2 ,  839 ,  672 , 100 ,  617 , 0.1460 , 0.3570 , 7.1140 , 0.0150 , 7.6320
2 ,  839 ,  672 ,  50 , 1732 , 0.1460 , 0.3590 , 20.0610 , 0.0150 , 20.5810
2 ,  839 ,  672 ,  25 , 3556 , 0.1520 , 0.3860 , 40.8780 , 0.0160 , 41.4320
2 ,  839 ,  672 ,  12 , 6627 , 0.1460 , 0.3990 , 76.1310 , 0.0160 , 76.6920
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 2.2480 , 88.3880 , 0.0460 , 90.8250
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2440 , 221.7200 , 0.0580 , 224.1650
1 , 1678 , 1344 ,  25 , 9083 , 0.1530 , 2.2100 , 442.7040 , 0.0480 , 445.1150
1 , 1678 , 1344 ,  12 , 17869 , 0.1440 , 2.2620 , 849.9970 , 0.0500 , 852.4530
0 , 3356 , 2687 , 100 , 4786 , 0.1300 , 10.9440 , 916.8950 , 0.1570 , 928.1260
0 , 3356 , 2687 ,  50 , 11942 , 0.1280 , 10.7100 , 2284.5040 , 0.1680 , 2295.5100
0 , 3356 , 2687 ,  25 , 29066 , 0.1300 , 10.7440 , 5561.0440 , 0.1690 , 5572.0870
0 , 3356 , 2687 ,  12 , 59634 , 0.1250 , 10.9860 , 11409.9540 , 0.1770 , 11421.2420

While this no longer has the issue of extremely long run-times at small images sizes (and relatively small label counts), it's quite obvious it scales poorly. We should be able to do better than that.

Optimizing the Baseline

First of all, we can avoid the need to split the image into single-channel iamges, process those and then reassemble them back together into a BGR format. Fortunately, OpenCV provides function cv2.mean, which will calculate per-channel mean of the (optionally masked) image.

Another useful optimization is pre-allocating and reusing the mask array in subsequent iterations (cv2.inRange has an optional argument that let's you give it an output array to reuse). Allocations (and de-allocations) can get quite costly, so the fewer you do, the better.

The most important observation to make at this point, that the size of each superpixel (region with a specific label) is generally much smaller than the whole image. Instead of processing the whole image for each label, we should limit majority of the work to the region of interest (ROI) -- the minimal rectangle that fits the pixels the belong to the specific label.

To determine the ROI, we can use cv2.boundingRect on the mask.

File improved_labels.py

import cv2
import numpy as np

def super_pixelize(img, labels, superpixel_count):
    result = np.zeros_like(img)

    mask = np.zeros(img.shape[:2], np.uint8) # Here it seems to make more sense to pre-alloc and reuse
    for label in range(0, superpixel_count):
        cv2.inRange(labels, label, label, dst=mask)

        # Find the bounding box of this label
        x,y,w,h = cv2.boundingRect(mask)

        # Work only on the rectangular region containing the label
        mask_roi = mask[y:y+h,x:x+w]
        img_roi = img[y:y+h,x:x+w]

        # Per-channel mean of the masked pixels (we're usingo BGR, so we drop the useless 4th channel it gives us)
        roi_mean = cv2.mean(img_roi, mask_roi)[:3]

        # Set all masked pixels in the ROI of the target image the the mean colour
        result[y:y+h,x:x+w][mask_roi != 0] = roi_mean

    return result

Benchmark script:

from superpix_harness import *

import improved_labels
measure_fn(improved_labels.super_pixelize)

Timings:

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1500 , 0.0600 , 0.0250 , 0.0070 , 0.2420
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0670 , 0.0760 , 0.0070 , 0.2990
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0740 , 0.1740 , 0.0070 , 0.4030
3 ,  420 ,  336 ,  12 , 3009 , 0.1480 , 0.0860 , 0.3560 , 0.0070 , 0.5970
2 ,  839 ,  672 , 100 ,  617 , 0.1510 , 0.3720 , 0.2450 , 0.0150 , 0.7830
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.3610 , 0.6450 , 0.0170 , 1.1710
2 ,  839 ,  672 ,  25 , 3556 , 0.1480 , 0.3730 , 1.2930 , 0.0160 , 1.8300
2 ,  839 ,  672 ,  12 , 6627 , 0.1480 , 0.4160 , 2.3840 , 0.0160 , 2.9640
1 , 1678 , 1344 , 100 , 1854 , 0.1420 , 2.2140 , 2.8510 , 0.0460 , 5.2530
1 , 1678 , 1344 ,  50 , 4519 , 0.1480 , 2.2280 , 6.7440 , 0.0470 , 9.1670
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.1920 , 13.5850 , 0.0480 , 15.9680
1 , 1678 , 1344 ,  12 , 17869 , 0.1440 , 2.2960 , 26.3940 , 0.0490 , 28.8830
0 , 3356 , 2687 , 100 , 4786 , 0.1250 , 10.9570 , 30.8380 , 0.1570 , 42.0770
0 , 3356 , 2687 ,  50 , 11942 , 0.1310 , 10.7930 , 76.1670 , 0.1710 , 87.2620
0 , 3356 , 2687 ,  25 , 29066 , 0.1250 , 10.7480 , 184.0220 , 0.1720 , 195.0670
0 , 3356 , 2687 ,  12 , 59634 , 0.1240 , 11.0440 , 377.8910 , 0.1790 , 389.2380

This is a lot better (we finished, at the least), although it still starts to get very expensive with large images/superpixel counts.

There are still options to do better, but we will have to think ouf of the box.

Going Further

Large images, and large superpixel counts still perform poorly. This is mainly due to the fact that for each superpixel, we need to process the whole labels array to determine a mask, and then process the entire mask to determine the ROI. The superpixels are rarely rectangular, so there is some more work wasted dealing with the pixels in the ROI that don't belong to current label (even if it's just testing the mask).

Let's recall that each location in the input can belong to a single superpixel (label). For each of the labels, we need to calculate mean R, G, and B intensity of all pixels that belong to it (i.e. first determine sum of each channel and count the number of pixels, then calculate means). We should be able to do this in a single pass over the input image and associated labels array. Once we have calculated the mean colour of each label, we can use this as a lookup table in a second pass over the labels array, populating the output image with appropriate colours.

In Python, we can implement this algorithm in the following manner:

Note: Though rather verbose, this is the best-performing version -- why, I cannot exactly explain, but it quite closely corresponds to the best-performing Cython function.

File fast_labels_python.py

import numpy as np

def super_pixelize(img, labels, superpixel_count):
    rows = img.shape[0]
    cols = img.shape[1]

    assert img.shape[0] == labels.shape[0]
    assert img.shape[1] == labels.shape[1]
    assert img.shape[2] == 3

    sums = np.zeros((superpixel_count, 3), dtype=np.int64)
    counts = np.zeros((superpixel_count, 1), dtype=np.int64)

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            sums[label, 0] = (sums[label, 0] + img[r, c, 0])
            sums[label, 1] = (sums[label, 1] + img[r, c, 1])
            sums[label, 2] = (sums[label, 2] + img[r, c, 2])
            counts[label, 0] = (counts[label, 0] + 1)

    label_colors = np.uint8(sums / counts)

    result = np.zeros_like(img)    
    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            result[r, c, 0] = label_colors[label, 0]
            result[r, c, 1] = label_colors[label, 1]
            result[r, c, 2] = label_colors[label, 2]

    return result

Benchmark script:

from superpix_harness import *

import fast_labels_python
measure_fn(fast_labels_python.super_pixelize)

Timings:

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1530 , 0.0590 , 0.5160 , 0.0070 , 0.7350
3 ,  420 ,  336 ,  50 ,  568 , 0.1470 , 0.0680 , 0.5250 , 0.0070 , 0.7470
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0740 , 0.5140 , 0.0070 , 0.7430
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0870 , 0.5190 , 0.0070 , 0.7620
2 ,  839 ,  672 , 100 ,  617 , 0.1480 , 0.3770 , 2.0720 , 0.0150 , 2.6120
2 ,  839 ,  672 ,  50 , 1732 , 0.1490 , 0.3680 , 2.0480 , 0.0160 , 2.5810
2 ,  839 ,  672 ,  25 , 3556 , 0.1470 , 0.3730 , 2.0570 , 0.0150 , 2.5920
2 ,  839 ,  672 ,  12 , 6627 , 0.1460 , 0.4140 , 2.0530 , 0.0170 , 2.6300
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 2.2080 , 8.2970 , 0.0470 , 10.6950
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2240 , 8.2970 , 0.0480 , 10.7120
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.2020 , 8.2280 , 0.0490 , 10.6220
1 , 1678 , 1344 ,  12 , 17869 , 0.1430 , 2.3010 , 8.3210 , 0.0520 , 10.8170
0 , 3356 , 2687 , 100 , 4786 , 0.1270 , 10.8630 , 33.0230 , 0.1580 , 44.1710
0 , 3356 , 2687 ,  50 , 11942 , 0.1270 , 10.7950 , 32.9230 , 0.1660 , 44.0110
0 , 3356 , 2687 ,  25 , 29066 , 0.1260 , 10.7270 , 33.3660 , 0.1790 , 44.3980
0 , 3356 , 2687 ,  12 , 59634 , 0.1270 , 11.0840 , 33.1850 , 0.1790 , 44.5750

Being a pure Python implementation, this really suffers from the interpreter overhead. At small image sizes and low superpixel counts, this overhead dominates. However, we can see that for given image size, the performance remains quite consistent irrespective of the number of superpixels. This is a good sign. Further good sign is that at large enough image sizes and superpixel counts, our more efficient algorithm starts to win.

To do any better, we'll have to avoid the interpreter overhead -- that means producing some code we can compile into a binary Python module which will perform the whole operation in one Python interpreter call.

Using Cython

Cython provides means to translate (annotated) Python code to C, and compile the result as a binary Python module. This can yield tremendous performance improvements when done right. Cython also includes support for numpy arrays that we can take advantage of.

NB: You will need to grok the Cython tutorials and documentation and do some experimentation to figure out how to annotate things to get the best performance (as I have done) -- detailed explanation is far beyond the scope of this (already excessive) answer.

File fast_labels_cython.pyx

# cython: infer_types=True
import numpy as np
cimport cython

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def super_pixelize(unsigned char[:, :, :] img, int[:, :] labels, int superpixel_count):
    cdef Py_ssize_t rows = img.shape[0]
    cdef Py_ssize_t cols = img.shape[1]

    assert img.shape[0] == labels.shape[0]
    assert img.shape[1] == labels.shape[1]
    assert img.shape[2] == 3

    sums = np.zeros((superpixel_count, 3), dtype=np.int64)
    cdef long long[:, ::1] sums_view = sums

    counts = np.zeros((superpixel_count, 1), dtype=np.int64)
    cdef long long[:, ::1] counts_view = counts

    cdef Py_ssize_t r, c
    cdef int label

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            sums_view[label, 0] = (sums_view[label, 0] + img[r, c, 0])
            sums_view[label, 1] = (sums_view[label, 1] + img[r, c, 1])
            sums_view[label, 2] = (sums_view[label, 2] + img[r, c, 2])
            counts_view[label, 0] = (counts_view[label, 0] + 1)

    label_colors = np.uint8(sums / counts)
    cdef unsigned char[:, ::1] label_colors_view = label_colors

    result = np.zeros_like(img)    
    cdef unsigned char[:, :, ::1] result_view = result

    for r in range(rows):
        for c in range(cols):
            label = labels[r,c]
            result_view[r, c, 0] = label_colors_view[label, 0]
            result_view[r, c, 1] = label_colors_view[label, 1]
            result_view[r, c, 2] = label_colors_view[label, 2]

    return result

Compilation:

cythonize.exe -2 -i fast_labels_cython.pyx

Benchmark script:

from superpix_harness import *

import fast_labels_python
measure_fn(fast_labels_python.super_pixelize)

Timings:

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1550 , 0.0600 , 0.0010 , 0.0080 , 0.2240
3 ,  420 ,  336 ,  50 ,  568 , 0.1500 , 0.0680 , 0.0010 , 0.0070 , 0.2260
3 ,  420 ,  336 ,  25 , 1415 , 0.1480 , 0.0750 , 0.0010 , 0.0070 , 0.2310
3 ,  420 ,  336 ,  12 , 3009 , 0.1490 , 0.0880 , 0.0010 , 0.0070 , 0.2450
2 ,  839 ,  672 , 100 ,  617 , 0.1480 , 0.3580 , 0.0040 , 0.0150 , 0.5250
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.3680 , 0.0050 , 0.0150 , 0.5360
2 ,  839 ,  672 ,  25 , 3556 , 0.1480 , 0.3780 , 0.0040 , 0.0170 , 0.5470
2 ,  839 ,  672 ,  12 , 6627 , 0.1470 , 0.4080 , 0.0040 , 0.0170 , 0.5760
1 , 1678 , 1344 , 100 , 1854 , 0.1440 , 2.2340 , 0.0170 , 0.0450 , 2.4400
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2450 , 0.0170 , 0.0480 , 2.4530
1 , 1678 , 1344 ,  25 , 9083 , 0.1440 , 2.2290 , 0.0170 , 0.0480 , 2.4380
1 , 1678 , 1344 ,  12 , 17869 , 0.1460 , 2.3310 , 0.0180 , 0.0500 , 2.5450
0 , 3356 , 2687 , 100 , 4786 , 0.1290 , 11.0840 , 0.0690 , 0.1560 , 11.4380
0 , 3356 , 2687 ,  50 , 11942 , 0.1330 , 10.7650 , 0.0680 , 0.1680 , 11.1340
0 , 3356 , 2687 ,  25 , 29066 , 0.1310 , 10.8120 , 0.0770 , 0.1710 , 11.1910
0 , 3356 , 2687 ,  12 , 59634 , 0.1310 , 11.1200 , 0.0790 , 0.1770 , 11.5070

Even with the largest image and almost 60 thousand superpixels, the processing time is less than a tenth of a second (compared to little over 3 hours in original variant).

Using Boost.Python

Another option is to implement the algorithm directly in a lower-level language. Due to my familiarity, I implemented a binary python module in C++ using Boost.Python. The library also has support for Numpy arrays, so the work was mostly validating input arguments and then porting the algorithm to use raw pointers.

File fast_labels.cpp

#define BOOST_ALL_NO_LIB
#include 
#include 
#include 

namespace bp = boost::python;

bp::numpy::ndarray super_pixelize(bp::numpy::ndarray const& image
    , bp::numpy::ndarray const& labels
    , int32_t label_count)
{
    if (image.get_dtype() != bp::numpy::dtype::get_builtin()) {
        throw std::runtime_error("Invalid image dtype.");
    }
    if (image.get_nd() != 3) {
        throw std::runtime_error("Image must be a 3d ndarray.");
    }
    if (image.shape(2) != 3) {
        throw std::runtime_error("Image must have 3 channels.");
    }

    if (labels.get_dtype() != bp::numpy::dtype::get_builtin()) {
        throw std::runtime_error("Invalid label dtype.");
    }
    if (!((labels.get_nd() == 2) || ((labels.get_nd() == 3) && (labels.shape(2) == 1)))) {
        throw std::runtime_error("Labels must have 1 channel.");
    }

    if ((image.shape(0) != labels.shape(0)) || (image.shape(1) != labels.shape(1))) {
        throw std::runtime_error("Image and labels have incompatible shapes.");
    }

    if (label_count < 1) {
        throw std::runtime_error("Must have at least 1 label.");
    }

    bp::numpy::ndarray result(bp::numpy::zeros(image.get_nd(), image.get_shape(), image.get_dtype()));

    int32_t const ROWS(image.shape(0));
    int32_t const COLUMNS(image.shape(1));

    int32_t const ROW_STRIDE_IMAGE(image.strides(0));
    int32_t const COLUMN_STRIDE_IMAGE(image.strides(1));

    int32_t const ROW_STRIDE_LABELS(labels.strides(0));
    int32_t const COLUMN_STRIDE_LABELS(labels.strides(1));

    int32_t const ROW_STRIDE_RESULT(result.strides(0));
    int32_t const COLUMN_STRIDE_RESULT(result.strides(1));

    struct label_info
    {
        int64_t sum_b = 0;
        int64_t sum_g = 0;
        int64_t sum_r = 0;
        int64_t count = 0;
    };

    struct pixel_type
    {
        uint8_t b;
        uint8_t g;
        uint8_t r;
    };

    // Step 1: Collect data for each label
    std::vector info(label_count);
    {
        char const* labels_row_ptr(labels.get_data());
        char const* image_row_ptr(image.get_data());
        for (int32_t row(0); row < ROWS; ++row) {
            char const* labels_col_ptr(labels_row_ptr);
            char const* image_col_ptr(image_row_ptr);
            for (int32_t col(0); col < COLUMNS; ++col) {
                int32_t label(*reinterpret_cast(labels_col_ptr));
                label_info& current_info(info[label]);

                pixel_type const& pixel(*reinterpret_cast(image_col_ptr));
                current_info.sum_b += pixel.b;
                current_info.sum_g += pixel.g;
                current_info.sum_r += pixel.r;
                ++current_info.count;

                labels_col_ptr += COLUMN_STRIDE_LABELS;
                image_col_ptr += COLUMN_STRIDE_IMAGE;
            }
            labels_row_ptr += ROW_STRIDE_LABELS;
            image_row_ptr += ROW_STRIDE_IMAGE;
        }
    }

    // Step 2: Calculate mean color for each label
    std::vector label_color(label_count);
    for (int32_t label(0); label < label_count; ++label) {
        label_info& current_info(info[label]);
        pixel_type& current_color(label_color[label]);

        current_color.b = current_info.sum_b / current_info.count;
        current_color.g = current_info.sum_g / current_info.count;
        current_color.r = current_info.sum_r / current_info.count;
    }


    // Step 3: Generate result
    {
        char const* labels_row_ptr(labels.get_data());
        char* result_row_ptr(result.get_data());
        for (int32_t row(0); row < ROWS; ++row) {
            char const* labels_col_ptr(labels_row_ptr);
            char* result_col_ptr(result_row_ptr);
            for (int32_t col(0); col < COLUMNS; ++col) {
                int32_t label(*reinterpret_cast(labels_col_ptr));
                pixel_type const& current_color(label_color[label]);

                pixel_type& pixel(*reinterpret_cast(result_col_ptr));
                pixel.b = current_color.b;
                pixel.g = current_color.g;
                pixel.r = current_color.r;

                labels_col_ptr += COLUMN_STRIDE_LABELS;
                result_col_ptr += COLUMN_STRIDE_RESULT;
            }
            labels_row_ptr += ROW_STRIDE_LABELS;
            result_row_ptr += ROW_STRIDE_RESULT;
        }
    }

    return result;
}

BOOST_PYTHON_MODULE(fast_labels)
{
    bp::numpy::initialize();

    bp::def("super_pixelize", super_pixelize);
}

Compilation:

Beyond the scope of this answer. I used CMake to build a DLL and then renamed it to .pyd for Python to recognise it.

Benchmark script:

from superpix_harness import *

import fast_labels
measure_fn(fast_labels.super_pixelize)

Timings:

Reduction, Width, Height, SP Size, SP Count, Time Load, Time SP, Time Process, Time Save, Time Total
3 ,  420 ,  336 , 100 ,  155 , 0.1480 , 0.0580 , 0.0010 , 0.0070 , 0.2140
3 ,  420 ,  336 ,  50 ,  568 , 0.1490 , 0.0690 , 0.0010 , 0.0070 , 0.2260
3 ,  420 ,  336 ,  25 , 1415 , 0.1510 , 0.0820 , 0.0010 , 0.0070 , 0.2410
3 ,  420 ,  336 ,  12 , 3009 , 0.1510 , 0.0970 , 0.0010 , 0.0070 , 0.2560
2 ,  839 ,  672 , 100 ,  617 , 0.1490 , 0.3750 , 0.0030 , 0.0150 , 0.5420
2 ,  839 ,  672 ,  50 , 1732 , 0.1480 , 0.7540 , 0.0020 , 0.0160 , 0.9200
2 ,  839 ,  672 ,  25 , 3556 , 0.1490 , 0.7070 , 0.0030 , 0.0160 , 0.8750
2 ,  839 ,  672 ,  12 , 6627 , 0.1590 , 0.7300 , 0.0030 , 0.0160 , 0.9080
1 , 1678 , 1344 , 100 , 1854 , 0.1430 , 3.7120 , 0.0100 , 0.0450 , 3.9100
1 , 1678 , 1344 ,  50 , 4519 , 0.1430 , 2.2510 , 0.0090 , 0.0510 , 2.4540
1 , 1678 , 1344 ,  25 , 9083 , 0.1430 , 2.2080 , 0.0100 , 0.0480 , 2.4090
1 , 1678 , 1344 ,  12 , 17869 , 0.1680 , 2.4280 , 0.0100 , 0.0500 , 2.6560
0 , 3356 , 2687 , 100 , 4786 , 0.1270 , 10.9230 , 0.0380 , 0.1580 , 11.2460
0 , 3356 , 2687 ,  50 , 11942 , 0.1300 , 10.8860 , 0.0390 , 0.1640 , 11.2190
0 , 3356 , 2687 ,  25 , 29066 , 0.1270 , 10.8080 , 0.0410 , 0.1800 , 11.1560
0 , 3356 , 2687 ,  12 , 59634 , 0.1280 , 11.1280 , 0.0410 , 0.1800 , 11.4770

Slightly better, although since we're more than 2 orders of magnitude faster than the code that determines the superpixel labels there's no need to go any further. With the largest image, and the smallest superpixel size we've improved by more than 6 orders of magnitude.

Is there a fast way to add two three dimensional arrays?

Answers (1)

Fixing the Slow-Down

Preparing for Optimization

Baseline

Optimizing the Baseline

Going Further

Using Cython

Using Boost.Python

Related Questions