Mauricio
Mauricio

Reputation: 3079

Find largest row in a matrix with numpy (row with highest length)

I have a massive array with rows and columns. Some rows are larger than others. I need to get the max length row, that is, the row that has the highest length. I wrote a simple function for this, but I wanted it to be as fas as possible, like numpy fast. Currently, it looks like this:

Example array:

values = [
    [1,2,3],
    [4,5,6,7,8,9],
    [10,11,12,13]
]

def values_max_width(values):
    max_width = 1
    for row in values:
        if len(row) > max_width:
            max_width = len(row)
    return max_width

Is there any way to accomplish this with numpy?

Upvotes: 0

Views: 1494

Answers (2)

hpaulj
hpaulj

Reputation: 231335

In [261]: values = [ 
     ...:     [1,2,3], 
     ...:     [4,5,6,7,8,9], 
     ...:     [10,11,12,13] 
     ...: ] 
     ...:                                                                       
In [262]:                                                                       
In [262]: values                                                                
Out[262]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [10, 11, 12, 13]]
In [263]: def values_max_width(values): 
     ...:     max_width = 1 
     ...:     for row in values: 
     ...:         if len(row) > max_width: 
     ...:             max_width = len(row) 
     ...:     return max_width 
     ...:                                                                       
In [264]: values_max_width(values)                                              
Out[264]: 6
In [265]: [len(v) for v in values]                                              
Out[265]: [3, 6, 4]
In [266]: max([len(v) for v in values])                                         
Out[266]: 6
In [267]: np.max([len(v) for v in values])                                      
Out[267]: 6

Your loop and the list comprehension are similar in speed, np.max is much slower - it has to first turn the list into an array.

In [268]: timeit max([len(v) for v in values])                                  
656 ns ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [269]: timeit np.max([len(v) for v in values])                               
13.9 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [271]: timeit values_max_width(values)                                       
555 ns ± 13 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

If you are starting with a list, it's a good idea to thoroughly test the list implementation. numpy is fast when it is doing compiled array stuff, but creating an array from a list is time consuming.

Making an array directly from values isn't much help. The result in a object dtype array:

In [272]: arr = np.array(values)                                                
In [273]: arr                                                                   
Out[273]: 
array([list([1, 2, 3]), list([4, 5, 6, 7, 8, 9]), list([10, 11, 12, 13])],
      dtype=object)

Math on such an array is hit-or-miss, and always slower than math on pure numeric arrays. We can iterate on such an array, but that iteration is slower than on a list.

In [275]: values_max_width(arr)                                                 
Out[275]: 6
In [276]: timeit values_max_width(arr)                                          
1.3 µs ± 8.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Upvotes: 1

Noam Peled
Noam Peled

Reputation: 4622

Not sure how you can make it faster. I've tried using np.max over the length of each item, but that will take even longer:

import numpy as np
import time

values = []
for k in range(100000):
    values.append(list(np.random.randint(100, size=np.random.randint(1000))))


def timeit(func):
    def wrapper(*args, **kwargs):
        now = time.time()
        retval = func(*args, **kwargs)
        print('{} took {:.5f}s'.format(func.__name__, time.time() - now))
        return retval
    return wrapper

@timeit
def values_max_width(values):
    max_width = 1
    for row in values:
        if len(row) > max_width:
            max_width = len(row)
    return max_width


@timeit
def value_max_width_len(values):
    return np.max([len(l) for l in values])


values_max_width(values)
value_max_width_len(values)

values_max_width took 0.00598s

value_max_width_len took 0.00994s

* Edit *

As @Mstaino suggested, using map does make this code faster:

@timeit
def value_max_width_len(values):
    return max(map(len, values))

values_max_width took 0.00598s

value_max_width_len took 0.00499s

Upvotes: 1

Related Questions