Reputation: 109
I have a function I will call 'rgb2something' that transforms RGB data [1x1x3] into a single value (a probability), looping over each pixel in the input RGB data turns out to be rather slow.
I have tried the following approach to speed the conversion up. To generate the LUT (Look up table):
import numpy as np
levels = 256
levels2 = levels**2
lut = [0] * (levels ** 3)
levels_range = range(0, levels)
for r in levels_range:
for g in levels_range:
for b in levels_range:
lut[r + (g * levels) + (b * levels2)] = rgb2something(r, g, b)
And to convert RGB to the transformed probability image:
result = np.take(lut, r_channel + (g_channel * 256) + (b_channel * 65536))
However both generating the LUT and calculating the result is still slow. In 2 dimensions it's rather fast, however in 3 dimensions (r, g and b) it's slow. How can I increase the performance of this?
rgb2something(r, g, b)
looks like this:
def rgb2something(r, g, b):
y = np.array([[r, g, b]])
y_mean = np.mean(y, axis=0)
y_centered = y - y_mean
y_cov = y_centered.T.dot(y_centered) / len(y_centered)
m = len(Consts.x)
n = len(y)
q = m + n
pool_cov = (m / q * x_cov) + (n / q * y_cov)
inv_pool_cov = np.linalg.inv(pool_cov)
g = Consts.x_mean - y_mean
mah = g.T.dot(inv_pool_cov).dot(g) ** 0.5
return mah
Full working code sample of what I'm trying to achieve, I am using OpenCV so any OpenCV approaches such as Apply LUT are welcome, as are C/C++ approaches:
import matplotlib.pyplot as plt
import numpy as np
import cv2
class Model:
x = np.array([
[6, 5, 2],
[2, 5, 7],
[6, 3, 1]
])
x_mean = np.mean(x, axis=0)
x_centered = x - x_mean
x_covariance = x_centered.T.dot(x_centered) / len(x_centered)
m = len(x)
n = 1 # Only ever comparing to a single pixel
q = m + n
pooled_covariance = (m / q * x_covariance) # + (n / q * y_cov) -< Always 0 for a single point
inverse_pooled_covariance = np.linalg.inv(pooled_covariance)
def rgb2something(r, g, b):
#Calculates Mahalanobis Distance between pixel and model X
y = np.array([[r, g, b]])
y_mean = np.mean(y, axis=0)
g = Model.x_mean - y_mean
mah = g.T.dot(Model.inverse_pooled_covariance).dot(g) ** 0.5
return mah
def generate_lut():
levels = 256
levels2 = levels**2
lut = [0] * (levels ** 3)
levels_range = range(0, levels)
for r in levels_range:
for g in levels_range:
for b in levels_range:
lut[r + (g * levels) + (b * levels2)] = rgb2something(r, g, b)
return lut
def calculate_distance(lut, input_image):
return np.take(lut, input_image[:, :, 0] + (input_image[:, :, 1] * 256) + (input_image[:, :, 2] * 65536))
lut = generate_lut()
rgb = np.random.randint(255, size=(1080, 1920, 3), dtype=np.uint8)
result = calculate_distance(lut, rgb)
cv2.imshow("Example", rgb)
cv2.imshow("Result", result)
cv2.waitKey(0)
Upvotes: 2
Views: 4365
Reputation: 53089
There are several straightforward and very effective optimizations:
(1) vectorize, vectorize! It is not so difficult to vectorize essentially everything in this code. See below.
(2) use proper lookup, i.e fancy indexing, not np.take
(3) use Cholesky decomp. With blas dtrmm
we can exploit its triangular structure
And here is the code. Just add it to the end of OP's code (under EDIT 2). Unless you are very patient you probably also want to comment out the lut = generate_lut()
and result = calculate_distance(lut, rgb)
lines and all references to cv2. I've also added a random row to x
to make its covariance matrix non singular.
class Full_Model(Model):
ch = np.linalg.cholesky(Model.inverse_pooled_covariance)
chx = Model.x_mean@ch
def rgb2something_vectorized(rgb):
return np.sqrt(np.sum(((rgb - Full_Model.x_mean)@Full_Model.ch)**2, axis=-1))
from scipy.linalg import blas
def rgb2something_blas(rgb):
*shp, nchan = rgb.shape
return np.sqrt(np.einsum('...i,...i', *2*(blas.dtrmm(1, Full_Model.ch.T, rgb.reshape(-1, nchan).T, 0, 0, 0, 0, 0).T - Full_Model.chx,))).reshape(shp)
def generate_lut_vectorized():
return rgb2something_vectorized(np.transpose(np.indices((256, 256, 256))))
def generate_lut_blas():
rng = np.arange(256)
arr = np.empty((256, 256, 256, 3))
arr[0, ..., 0] = rng
arr[0, ..., 1] = rng[:, None]
arr[1:, ...] = arr[0]
arr[..., 2] = rng[:, None, None]
return rgb2something_blas(arr)
def calculate_distance_vectorized(lut, input_image):
return lut[input_image[..., 2], input_image[..., 1], input_image[..., 0]]
# test code
def random_check_lut(lut):
"""Because the original lut generator is excruciatingly slow,
we only compare a random sample, using the original code
"""
levels = 256
levels2 = levels**2
lut = lut.ravel()
levels_range = range(0, levels)
for r, g, b in np.random.randint(0, 256, (1000, 3)):
assert np.isclose(lut[r + (g * levels) + (b * levels2)], rgb2something(r, g, b))
import time
td = []
td.append((time.time(), 'create lut vectorized'))
lutv = generate_lut_vectorized()
td.append((time.time(), 'create lut using blas'))
lutb = generate_lut_blas()
td.append((time.time(), 'lookup using np.take'))
res = calculate_distance(lutv, rgb)
td.append((time.time(), 'process on the fly (no lookup)'))
resotf = rgb2something_vectorized(rgb)
td.append((time.time(), 'process on the fly (blas)'))
resbla = rgb2something_blas(rgb)
td.append((time.time(), 'lookup using fancy indexing'))
resv = calculate_distance_vectorized(lutv, rgb)
td.append((time.time(), None))
print("sanity checks ... ", end='')
assert np.allclose(res, resotf) and np.allclose(res, resv) \
and np.allclose(res, resbla) and np.allclose(lutv, lutb)
random_check_lut(lutv)
print('all ok\n')
t, d = zip(*td)
for ti, di in zip(np.diff(t), d):
print(f'{di:32s} {ti:10.3f} seconds')
Sample run:
sanity checks ... all ok
create lut vectorized 1.116 seconds
create lut using blas 0.917 seconds
lookup using np.take 0.398 seconds
process on the fly (no lookup) 0.127 seconds
process on the fly (blas) 0.069 seconds
lookup using fancy indexing 0.064 seconds
We can see that the best lookup beats the best on-the-fly computation by a whisker. That said the example may overestimate lookup cost, because random pixels are presumably less cache friendly than natural images.
If rgb2something can't be vectorized, and you want to process one typical image, then you can get a decent speedup using np.unique
.
If rgb2something is expensive and multiple images have to be processed, then unique
can be combined with caching, which is conveniently done using functools.lru_cache
---only (minor) stumbling block: arguments must be hashable. As it turns out the modification in code that this forces (casting rgb-arrays to 3-byte strings) happens to benefit performance.
Using a full look up table is only worth it if you have a huge number of pixels covering most hues. In that case the fastest way is using numpy fancy indexing to do the actual lookup.
import numpy as np
import time
import functools
def rgb2something(rgb):
# waste some time:
np.exp(0.1*rgb)
return rgb.mean()
@functools.lru_cache(None)
def rgb2something_lru(rgb):
rgb = np.frombuffer(rgb, np.uint8)
# waste some time:
np.exp(0.1*rgb)
return rgb.mean()
def apply_to_img(img):
shp = img.shape
return np.reshape([rgb2something(x) for x in img.reshape(-1, shp[-1])], shp[:2])
def apply_to_img_lru(img):
shp = img.shape
return np.reshape([rgb2something_lru(x) for x in img.ravel().view('S3')], shp[:2])
def apply_to_img_smart(img, print_stats=True):
shp = img.shape
unq, bck = np.unique(img.reshape(-1, shp[-1]), return_inverse=True, axis=0)
if print_stats:
print('total no pixels', shp[0]*shp[1], '\nno unique pixels', len(unq))
return np.array([rgb2something(x) for x in unq])[bck].reshape(shp[:2])
def apply_to_img_smarter(img, print_stats=True):
shp = img.shape
unq, bck = np.unique(img.ravel().view('S3'), return_inverse=True)
if print_stats:
print('total no pixels', shp[0]*shp[1], '\nno unique pixels', len(unq))
return np.array([rgb2something_lru(x) for x in unq])[bck].reshape(shp[:2])
def make_full_lut():
x = np.empty((3,), np.uint8)
return np.reshape([rgb2something(x) for x[0] in range(256)
for x[1] in range(256) for x[2] in range(256)],
(256, 256, 256))
def make_full_lut_cheat(): # for quicker testing lookup
i, j, k = np.ogrid[:256, :256, :256]
return (i + j + k) / 3
def apply_to_img_full_lut(img, lut):
return lut[(*np.moveaxis(img, 2, 0),)]
from scipy.misc import face
t0 = time.perf_counter()
bw = apply_to_img(face())
t1 = time.perf_counter()
print('naive ', t1-t0, 'seconds')
t0 = time.perf_counter()
bw = apply_to_img_lru(face())
t1 = time.perf_counter()
print('lru first time ', t1-t0, 'seconds')
t0 = time.perf_counter()
bw = apply_to_img_lru(face())
t1 = time.perf_counter()
print('lru second time ', t1-t0, 'seconds')
t0 = time.perf_counter()
bw = apply_to_img_smart(face(), False)
t1 = time.perf_counter()
print('using unique: ', t1-t0, 'seconds')
rgb2something_lru.cache_clear()
t0 = time.perf_counter()
bw = apply_to_img_smarter(face(), False)
t1 = time.perf_counter()
print('unique and lru first: ', t1-t0, 'seconds')
t0 = time.perf_counter()
bw = apply_to_img_smarter(face(), False)
t1 = time.perf_counter()
print('unique and lru second:', t1-t0, 'seconds')
t0 = time.perf_counter()
lut = make_full_lut_cheat()
t1 = time.perf_counter()
print('creating full lut: ', t1-t0, 'seconds')
t0 = time.perf_counter()
bw = apply_to_img_full_lut(face(), lut)
t1 = time.perf_counter()
print('using full lut: ', t1-t0, 'seconds')
print()
apply_to_img_smart(face())
import Image
Image.fromarray(bw.astype(np.uint8)).save('bw.png')
Sample run:
naive 6.8886632949870545 seconds
lru first time 1.7458112589956727 seconds
lru second time 0.4085628940083552 seconds
using unique: 2.0951434450107627 seconds
unique and lru first: 2.0168916099937633 seconds
unique and lru second: 0.3118703299842309 seconds
creating full lut: 151.17599205300212 seconds
using full lut: 0.12164952099556103 seconds
total no pixels 786432
no unique pixels 134105
Upvotes: 5
Reputation: 3338
Firstly, please add what Consts
is in your rgb2something
function, as that will help us understand what exactly the function does.
The best way to speed this up would be to vectorize the operation.
You don't need to construct a lookup table for this operation. If you have a function that is applied on each (r, g, b)
vector, you can simply apply it for every vector in the image using np.apply_along_axis
. In the following example, I'm assuming a simple definition for rgb2something
as a placeholder - this function can of course be replaced by your definition.
def rgb2something(vector):
return sum(vector)
image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
transform = np.apply_along_axis(rgb2something, -1, image)
This takes the image
array, and applies the function rgb2something
to each 1-D slice along axis -1
(which is the last channel axis).
While caching is not necessary, there may be specific use cases when it will benefit you greatly. Perhaps you want to perform this pixel-wise operation of rgb2something
across thousands of images, and you suspect that many pixel values will be repeated across images. In such cases, constructing a lookup table could significantly improve performance. I'd suggest lazily filling up the table (I suggest this presuming that your dataset spans images which are somewhat similar - having similar objects, textures and so on, which would mean that in total they span only a relatively small subset of the whole 2^24 search space). If you feel that they span a relatively large subset, you could construct the entire lookup table beforehand (see next section).
lut = [-1] * (256 ** 3)
def actual_rgb2something(vector):
return sum(vector)
def rgb2something(vector):
value = lut[vector[0] + vector[1] * 256 + vector[2] * 65536]
if value == -1:
value = actual_rgb2something(vector)
lut[vector[0] + vector[1] * 256 + vector[2] * 65536] = value
return value
You can then transform each image same as before:
image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
transform = np.apply_along_axis(rgb2something, -1, image)
Perhaps your images are diverse enough to encompass a large set of the entire search range, and the cost of construction of the entire cache can be amortized by the reduced lookup cost.
from itertools import product
lut = [-1] * (256 ** 3)
def actual_rgb2something(vector):
return sum(vector)
def fill(vector):
value = actual_rgb2something(vector)
lut[vector[0] + vector[1] * 256 + vector[2] * 65536] = value
# Fill the table
total = list(product(range(256), repeat=3))
np.apply_along_axis(fill, arr=total, axis=1)
Now instead of computing the values again, you can simply look them up from the table:
def rgb2something(vector):
return lut[vector[0] + vector[1] * 256 + vector[2] * 65536]
Transforming images is of course the same as before:
image = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
transform = np.apply_along_axis(rgb2something, -1, image)
Upvotes: 2