CodeSurgeon
CodeSurgeon

Reputation: 2465

Writing a fast 3d matrix and vector library in cython

As part of rewriting my game engine in cython, I am trying to improve the performance of my python+numpy classes for matrix and vector math as this was one of the major bottlenecks I had run into previously. This group of modules had defined classes for types such as Vector2/3/4, Matrix2/3/4, and Quaternion.

Taking a page from the glMatrix javascript library, I thought that one thing I could do this time around is switch away from a class-based system to a module with just a bunch of math functions to reduce more overhead. That way, instead of returning a new object every time I, say, added two vectors together, I would not have to construct a custom object.

To test this out, I wrote a benchmark demo for creating two Vec2 objects a and b summing them component-wise to get a Vec2 object out. The code for this is broken down into a main.py that does the timing, a vec2.pyx for the cython code, and a pyvec2.py for the python code. Here is the code for each of those components:

main.py

import time
import array
import math3d.vec2 as vec2
import math3d.pyvec2 as pyvec2

def test(n, func, param_list):
    start = time.time()
    for i in range(n):
        func(*param_list)
    end = time.time()
    print func, end-start

test(1000000, pyvec2.pyadd, [[1, 2], [3, 4]])
test(1000000, pyvec2.pyadd2, [[0, 0], [1, 2], [3, 4]])
test(1000000, vec2.add, [[1, 2], [3, 4]])
test(1000000, vec2.add2, [array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add3, [array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add4, [array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add5, [[0, 0], [1, 2], [3, 4]])
test(1000000, vec2.add6, [array.array("f", [0, 0]), array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add7, [array.array("f", [0, 0]), array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add8, [array.array("f", [0, 0]), array.array("f", [1, 2]), array.array("f", [3, 4])])
test(1000000, vec2.add9, [[0, 0], [1, 2], [3, 4]])

vec2.pyx

from libc.stdlib cimport malloc, free
from cpython cimport array
import array

def add(list a, list b):
    cdef float[2] out = [0, 0]
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add2(float[:] a, float[:] b):
    cdef float[2] out = [0, 0]
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add3(array.array a, array.array b):
    cdef float[2] out = [0, 0]
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add4(array.array a, array.array b):
    cdef array.array out = array.array("f", [0, 0])
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add5(list out, list a, list b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add6(float[:] out, float[:] a, float[:] b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add7(array.array out, array.array a, array.array b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add8(array.array out, array.array a, array.array b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

def add9(out, a, b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

pyvec2.py

def pyadd(a, b):
    out = [a[0] + b[0], a[1] + b[1]]

def pyadd2(out, a, b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

Here the results after running main.py:

<function pyadd at 0x0000000003354828> 0.380000114441
<function pyadd2 at 0x0000000003354908> 0.31299996376
<built-in function add> 0.261000156403
<built-in function add2> 0.680999994278
<built-in function add3> 0.268000125885
<built-in function add4> 0.601000070572
<built-in function add5> 0.144999980927
<built-in function add6> 1.06299996376
<built-in function add7> 0.241000175476
<built-in function add8> 0.237999916077
<built-in function add9> 0.141000032425

From this, it looks like just using python lists are faster than typed arrays! Furthermore, just blindly compiling my python function without typing into cython yielded the best results! It looks like a significant portion of the time spent executing the code is used to convert to and from python types.

I am therefore curious to know if there are even faster ways to perform the math on the cython side while minimizing python overhead passing in arguments. I am not really interested in necessarily exposing lists or array.array objects for direct access to the contents of my vectors or matrices. It would be ideal to just pass a pointer to and from python into my cython math module, but this does not seem possible as pointers are not python objects. Any suggestions would be appreciated.


UPDATE:

Below is the code for my Vec2 class. It is composed of two files: the first is a base _Vec class, and the second inherits from it as a specific Vec2 class.

vec.py

import math
import numpy as np
import random

class _Vec(object):

    def __init__(self, *args):
        try:
            data, = args
            data = np.array(data, dtype=np.float32)
            cls_name = self.__class__.__name__
            vec2_check = cls_name == "Vec2" and len(data) != 2
            vec3_check = cls_name == "Vec3" and len(data) != 3
            vec4_check = cls_name == "Vec4" and len(data) != 4
            if any([vec2_check, vec3_check, vec4_check]) == True:
                raise TypeError("{0} is not a valid {1}".format(data, cls_name))
        except ValueError:
            data = np.array(args, dtype=np.float32)
        self._data = data

    def __add__(self, other):
        if isinstance(other, self.__class__):
            return self.__class__(self._data + other._data)
        return self.__class__(self._data + other)

    def __radd__(self, other):
        return self.__class__(other + self._data)

    def __sub__(self, other):
        if isinstance(other, self.__class__):
            return self.__class__(self._data - other._data)
        return self.__class__(self._data - other)

    def __rsub__(self, other):
        return self.__class__(other - self._data)

    def __mul__(self, other):
        if isinstance(other, self.__class__):
            return self.__class__(self._data * other._data)
        return self.__class__(self._data * other)

    def __rmul__(self, other):
        return self.__class__(other * self._data)

    def __div__(self, other):
        if isinstance(other, self.__class__):
            return self.__class__(self._data / other._data)
        return self.__class__(self._data / other)

    def __rdiv__(self, other):
        return self.__class__(other / self._data)

    def __neg__(self):
        return self.__class__(-self._data)

    def __pos__(self):
        return self.__class__(+self._data)

    def __eq__(self, other):
        return np.array_equal(self._data, other._data)

    def __ne__(self, other):
        return not self.__eq__(other)

    def __lt__(self, other):
        return self.square_length() < other.square_length()

    def __le__(self, other):
        return self.square_length() <= other.square_length()

    def __gt__(self, other):
        return self.square_length() > other.square_length()

    def __ge__(self, other):
        return self.square_length() >= other.square_length()

    def __repr__(self):
        return "{0}(data={1})".format(self.__class__.__name__, self.get_data())

    def __str__(self):
        return np.array_str(self._data)

    def ceil(self):
        return self.__class__(np.ceil(self._data))

    def floor(self):
        return self.__class__(np.floor(self._data))

    def get_data(self):
        return self._data.flatten().tolist()

    def inverse(self):
        return self.__class__(1.0/self._data)

    def length(self):
        return float(np.linalg.norm(self._data))

    def negate(self):
        return self.__class__(-self._data)

    def normalize(self):
        length = self.length()
        if length == 0.0:
            return self.__class__(np.zeros(self._data.shape()))
        return self.__class__(self._data/length)

    def round(self, decimal=0):
        return self.__class__(np.round(self._data, decimal))

    def square_length(self):
        return float(np.sum(np.square(self._data)))

    @classmethod
    def distance(cls, a, b):
        c = b - a
        return c.length()

    @classmethod
    def dot(cls, a, b):
        return float(np.dot(a._data, b._data))

    @classmethod
    def equals(cls, a, b, tolerance=0.0):
        diffs = np.fabs((a - b)._data)
        pairs = zip(list(np.fabs(a._data)), list(np.fabs(b._data)))
        tolerance_calcs = [tolerance * max(1, a_val, b_val) for (a_val, b_val) in pairs]
        tests = [d <= t for (d, t) in zip(diffs, tolerance_calcs)]
        return all(tests)

    @classmethod
    def lerp(cls, a, b, t):
        return a*(1-t) + b*t

    @classmethod
    def max_components(cls, a, b):
        return cls(np.maximum(a._data, b._data))

    @classmethod
    def min_components(cls, a, b):
        return cls(np.minimum(a._data, b._data))

    @classmethod
    def random(cls, n):
        return cls(np.random.rand((n)))

    @classmethod
    def square_distance(cls, a, b):
        c = b - a
        return c.square_length()

vec2.py

from vec import _Vec
from vec3 import Vec3
from vec4 import Vec4
import math
import numpy as np
import random

class Vec2(_Vec):

    @property
    def x(self):
        return float(self._data[0])

    @x.setter
    def x(self, value):
        self._data[0] = float(value)

    @property
    def y(self):
        return float(self._data[1])

    @y.setter
    def y(self, value):
        self._data[1] = float(value)

    def __repr__(self):
        return "Vec2(x={0}, y={1})".format(self.x, self.y)

    def transform_mat2(self, a):
        prod = np.dot(a._data.T, self._data.T).T
        return Vec2(prod)

    def transform_mat3(self, a):
        v3 = Vec3(self.get_data() + [1])
        prod = np.dot(a._data.T, v3._data.T).T
        return Vec2(prod[0:2])

    def transform_mat4(self, a):
        v4 = Vec4(self.get_data() + [0, 1])
        prod = np.dot(a._data.T, v4._data.T).T
        return Vec2(prod[0:2])

    @classmethod
    def random(cls):
        return super(Vec2, cls).random(2)

UPDATE 2:

As kazemakase pointed out, some of the values in main.py were integers. Defining everything as floating point by appending a .0, I got the following times instead:

<function pyadd at 0x0000000002FF4828> 0.384000062943
<function pyadd2 at 0x0000000002FF4908> 0.332000017166
<built-in function add> 0.227999925613
<built-in function add2> 0.640000104904
<built-in function add3> 0.258999824524
<built-in function add4> 0.556999921799
<built-in function add5> 0.145999908447
<built-in function add6> 0.983999967575
<built-in function add7> 0.217000007629
<built-in function add8> 0.236000061035
<built-in function add9> 0.131000041962

These appear to be similar to the original times in that case 5 and 9 are faster.


UPDATE 3:

As BrenBarn pointed out, it would probably be useful to explain how I used my original python+numpy class and why I am struggling with its performance. Originally, my entire 3d game library project was in pure python, using PyOpenGL for rendering the graphics. In order to position meshes/models in my 3d world, every frame, I need to calculate a Mat4 transformation matrix that defines that object's position, rotation, and scale in the world that is then uploaded to the GPU. When I was positioning many objects (> 1000), the frame rate in my app would slow down. While temporarily disabling 3d did improve performance somewhat, my app was still lagging below 60 fps. That is when I realized that simply calculating the Mat4 matrices in python was to blame. When I removed these calculations and just drew all of my 3d objects at the origin, performance went back up.

Wondering then if ALL of my math3d library was slow, I thought I would start benchmarking all of these classes. I did the following benchmark for my Vec2 class's add function as it was the simplest and computationally cheapest function to test:

import time
from vec2 import Vec2

def add(a, b):
    out = a + b
    return out

def test(n, func, param_list):
    start = time.time()
    for i in range(n):
        func(*param_list)
    end = time.time()
    print func, end-start

test(1000000, add, [Vec2([1.0, 2.0]), Vec2([3.0, 4.0])])
#<function add at 0x00000000041B1CF8> 2.81699991226 (using the real def add(a, b) function)
#<function add at 0x000000000362CCF8> 0.168999910355 (just passing in values to def add(a, b): pass)

Those comments showed the times I had got just for adding two Vec2 together. From that, I concluded that the math was slow, that there was significant python function call + class overhead, and that this was a performance bottleneck to consider in my 3d library. I hope this provides some rationale for this question.


Update 4:

What Paul Cornelius said about incurring the penalty of translation from python to cython only once got me thinking: why not just fetch a "pointer" to a float * object whenever I create a "Vec2 object"? Then, I can just pass in pointers for future math operations, cython can dereference those pointers to get the actual data, and the math operation can be performed. The result of that is the following code:

main.py

import time
import array
import math3d.vec2 as vec2

def make_list(a, b):
    out = [a, b]

def test(n, func, param_list):
    start = time.time()
    for i in range(n):
        func(*param_list)
    end = time.time()
    print func, end-start

test(1000000, vec2.create, [1, 2])
test(1000000, make_list, [1, 2])
a = vec2.create(1, 2)
#b = vec2.get_data(a)
b = vec2.create(3, 4)
c = vec2.create(0, 0)
test(1000000, vec2.add, [c, a, b])
test(1000000, vec2.add2, [[0, 0], [1, 2], [3, 4]])
print vec2.get_data(c)

vec2.pyx

def add(uintptr_t out, uintptr_t a, uintptr_t b):
    cdef float *a_data = <float *>a
    cdef float *b_data = <float *>b
    cdef float *out_data = <float *>out
    out_data[0] = a_data[0] + b_data[0]
    out_data[1] = a_data[1] + b_data[1]

def create(float x, float y):
    cdef float* a = <float *>malloc(sizeof(float))
    a[:] = [x, y]
    cdef uintptr_t u_ptr = <uintptr_t> a
    return u_ptr

def get_data(uintptr_t u_ptr):
    cdef float *b = <float *>u_ptr
    return b[0], b[1]

def add2(out, a, b):
    out[0] = a[0] + b[0]
    out[1] = a[1] + b[1]
    return out

Timing Results

<built-in function create> 0.19000005722
<function make_list at 0x0000000002994828> 0.269999980927
<built-in function add> 0.111000061035
<built-in function add2> 0.141999959946
(4.0, 6.0)

Of course, handling uintptr_t integers in python is pretty unsafe, as they could mistakenly be added together on the python side with the + operator. Also, it is unclear if the slight performance benefit of this (0.03 seconds for 1 million operations) is really worth it.

Upvotes: 0

Views: 838

Answers (1)

Paul Cornelius
Paul Cornelius

Reputation: 10946

Cython's strength is arithmetic operations on basic C data types (ints, floats, doubles) and arrays of those. The only arithmetic operation in your code is two simple additions. The rest is data type conversions and array element access. Those operations will certainly dominate, as your timing results indicate. Every time you execute a function call from Python to Cython, there is type checking and conversion overhead. You need to do enough number crunching on the Cython side of this barrier to make it worthwhile.

This is not a good use case for Cython.

Is adding a lot of two-element lists really the bottleneck in your application? If it is, you should consider storing all the data in Cython variables, incurring the penalty of translation only once. Move data to the Python side only when you have to. If it isn't, you need to create a more realistic test.

My experience is that Cython can usually match or even outperform numpy, although it make take some effort to optimize. (numpy/scipy of course has the "slight" advantage of offering more functionality than I could create in a hundred lifetimes). But the same process of converting Python to C data types must occur there as well.

Upvotes: 1

Related Questions