David
David

Reputation: 83

Fastest way to make python Object out of numpy array rows

I need to make a list of objects out of a numpy array (or a pandas dataframe). Each row holds all the attribute values for the object (see example).

import numpy as np

class Dog:

def __init__(self, weight, height, width, girth):
    self.weight = weight
    self.height = height
    self.width = width
    self.girth = girth


dogs = np.array([[5, 100, 50, 80], [4, 80, 30, 70], [7, 120, 60, 90], [2, 50, 30, 50]])

# list comprehension with idexes
dog_list = [Dog(dogs[i][0], dogs[i][1], dogs[i][2], dogs[i][3]) for i in range(len(dogs))]

My real data is of course much bigger (up to a million rows with 5 columns), so iterating line by line and looking up the correct index takes ages. Is there a way to vectorize this or generally make it more efficient/faster? I tried finding ways myself, but I couldn't find anything translatable, at least at my level of expertise.

It's extremely important that the order of rows is preserved though, so if that doesn't work out, I suppose I'll have to live with the slow operation.

Cheers!

EDIT - regarding question about np.vectorize:

This is part of my actual code along with some actual data:

import numpy as np

class Particle:
    TrackID = 0
    def __init__(self, uniq_ident, intensity, sigma, chi2, past_nn_ident, past_distance, aligned_x, aligned_y, NeNA):
        self.uniq_ident = uniq_ident
        self.intensity = intensity
        self.sigma = sigma
        self.chi2 = chi2
        self.past_nn_ident = past_nn_ident
        self.past_distance = past_distance
        self.aligned_y = aligned_y
        self.aligned_x = aligned_x
        self.NeNA = NeNA
        self.new_track_length = 1
        self.quality_pass = True  
        self.re_seeder(self.NeNA)


def re_seeder(self, NeNA):

    if np.isnan(self.past_nn_ident):  
        self.newseed = True            
        self.new_track_id = Particle.TrackID
        print(self.new_track_id)
        Particle.TrackID += 1

    else:
        self.newseed = False
        self.new_track_id = None

data = np.array([[0.00000000e+00, 2.98863746e+03, 2.11794100e+02, 1.02241467e+04, np.NaN,np.NaN, 9.00081968e+02, 2.52456745e+04, 1.50000000e+01],
       [1.00000000e+00, 2.80583577e+03, 4.66145720e+02, 6.05642671e+03, np.NaN, np.NaN, 8.27249728e+02, 2.26365501e+04, 1.50000000e+01],
       [2.00000000e+00, 5.28702810e+02, 3.30889610e+02, 5.10632793e+03, np.NaN, np.NaN, 6.03337243e+03, 6.52702811e+04, 1.50000000e+01],
       [3.00000000e+00, 3.56128350e+02, 1.38663730e+02, 3.37923885e+03, np.NaN, np.NaN, 6.43263261e+03, 6.14788766e+04, 1.50000000e+01],
       [4.00000000e+00, 9.10148200e+01, 8.30057400e+01, 4.31205993e+03, np.NaN, np.NaN, 7.63955009e+03, 6.08925862e+04, 1.50000000e+01]])

Particle.TrackID = 0
particles = np.vectorize(Particle)(*data.transpose())

l = [p.new_track_id for p in particles]

The curious thing about this is that the print statement inside the ree_seeder function "print(self.new_track_id)", it prints 0, 1, 2, 3, 4, 5.

If I then take the particle objects and make a list out of their new_track_id attributes "l = [p.new_track_id for p in particles]" the values are 1, 2, 3, 4, 5.

So somewhere, somehow the first object is either lost, re-written or something else I don't understand.

Upvotes: 0

Views: 744

Answers (3)

hpaulj
hpaulj

Reputation: 231530

With a simple class:

class Foo():
    _id = 0
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
        self.id = self._id
        Foo._id += 1
    def __repr__(self):
        return '<Foo %s>'%self.id


In [23]: arr = np.arange(12).reshape(4,3)

A straightforward list comprehension:

In [24]: [Foo(*xyz) for xyz in arr]
Out[24]: [<Foo 0>, <Foo 1>, <Foo 2>, <Foo 3>]

Default use of vectorize:

In [26]: np.vectorize(Foo)(*arr.T)
Out[26]: array([<Foo 5>, <Foo 6>, <Foo 7>, <Foo 8>], dtype=object)

Note that Foo 4 was skipped. vectorize performs a trial calculation to determine the return dtype (here object). (This has caused problems for other users.) We can get around that by specifying otypes. There's also a cache parameter that might work, but I haven't played with that.

In [27]: np.vectorize(Foo,otypes=[object])(*arr.T)
Out[27]: array([<Foo 9>, <Foo 10>, <Foo 11>, <Foo 12>], dtype=object)

Internally vectorize uses frompyfunc, which in this case works just as well, and in my experience is faster:

In [28]: np.frompyfunc(Foo, 3,1)(*arr.T)
Out[28]: array([<Foo 13>, <Foo 14>, <Foo 15>, <Foo 16>], dtype=object)

Normally vectorize/frompyfunc pass 'scalar' values to the function, iterating overall elements of a 2d array. But the use of *arr.T is a clever way of passing rows - effectively a 1d array of tuples.

In [31]: list(zip(*arr.T)) 
Out[31]: [(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)]

Some comparative times:

In [32]: Foo._id=0
In [33]: timeit [Foo(*xyz) for xyz in arr]
14.2 µs ± 17.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [34]: Foo._id=0
In [35]: timeit np.vectorize(Foo,otypes=[object])(*arr.T)
44.9 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: Foo._id=0
In [37]: timeit np.frompyfunc(Foo, 3,1)(*arr.T)
15.6 µs ± 18.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This is consistent with my past timings. vectorize is slow. frompyfunc is competitive with a list comprehension, sometimes even 2x faster. Wrapping the list comprehension in an array will slow it down, e.g. np.array([Foo(*xyz)...]).

And your original list comprehension:

In [40]: timeit [Foo(arr[i][0],arr[i][1],arr[i][2]) for i in range(len(arr))]
10.1 µs ± 80 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That's even faster! So if your goal is a list rather than an array, I don't see the point to using numpy tools.

Of course these timings on a small example need to be viewed with caution.

Upvotes: 0

randomwalker
randomwalker

Reputation: 1703

Multiprocessing might be worth a look.

from multiprocessing import Pool dog_list = []

Function to append objects to the list:

def append_dog(i): dog_list.append(Dog(*dogs[i]))

Let multiple workers append to this list in parallel:

number_of_workers = 4 pool = Pool(processes=number_of_workers) pool.map_async(append_dog, range(len(dogs)))

Or as a shorter version:

from multiprocessing import Pool
number_of_workers = 4
pool = Pool(processes=number_of_workers)
pool.map_async(lambda i: dog_list.append(Dog(*dogs[i])), range(len(dogs)))

Upvotes: 0

lxop
lxop

Reputation: 8605

You won't get great efficiency/speed gains as long as you are insisting on building Python objects. With that many items, you will be much better served by keeping the data in the numpy array. If you want nicer attribute access, you could cast the array as a record array (recarray), which would allow you to name the columns (as weight, height, etc) while still having the data in the numpy array.

dog_t = np.dtype([
    ('weight', int),
    ('height', int),
    ('width', int),
    ('girth', int)
])

dogs = np.array([
    (5, 100, 50, 80),
    (4, 80, 30, 70),
    (7, 120, 60, 90),
    (2, 50, 30, 50),
], dtype=dog_t)

dogs_recarray = dogs.view(np.recarray)

print(dogs_recarray.weight)
print(dogs_recarray[2].height)

You can also mix and match data types if you need to (if some columns are integer and others are float, for example). Be aware when playing with this code that the items in the dogs array needs to be specified in tuples (using ()) rather than in lists for the datatype to be applied properly.

Upvotes: 2

Related Questions