Tom
Tom

Reputation: 233

Physically transposing a large non-square numpy matrix

Is there any quicker way to physically transpose a large 2D numpy matrix than array.transpose.copy()? And are there any routines for doing it with efficient memory use?

Upvotes: 2

Views: 474

Answers (3)

Han-Kwang Nienhuys
Han-Kwang Nienhuys

Reputation: 3254

I assume that you need to do a row-wise operation that uses the CPU cache more efficiently if rows are contiguous in memory, and you don't have enough memory available to make a copy.

Wikipedia has an article on in-place matrix transposition. It turns out that such a transposition is nontrivial. Here is a follow-the-cycles algorithm as described there:

import numpy as np
from numba import njit

@njit # comment this line for debugging
def transpose_in_place(a):
    """In-place matrix transposition for a rectangular matrix.
    
    https://stackoverflow.com/a/62507342/6228891 
    
    Parameter:
    
    - a: 2D array. Unless it's a square matrix, it will be scrambled
      in the process.
    
    Return:
        
    - transposed array, using the same in-memory data storage as the 
      input array.
      
    This algorithm is typically 10x slower than a.T.copy().
    Only use it if you are short on memory.
    """
    if a.shape == (1, 1):
        return a # special case

    n, m = a.shape
    
    # find max length L of permutation cycle by starting at a[0,1].
    # k is the index in the flat buffer; i, j are the indices in
    # a.
    L = 0
    k = 1
    while True:
        j = k % m
        i = k // m
        k = n*j + i
        L += 1
        if k == 1:
            break
    permut = np.zeros(L, dtype=np.int32)

    # Now do the permutations, one cycle at a time    
    seen = np.full(n*m, False)
    aflat = a.reshape(-1) # flat view
    
    for k0 in range(1, n*m-1):
        if seen[k0]:
            continue
        # construct cycle
        k = k0
        permut[0] = k0
        q = 1 # size of permutation array
        while True:
            seen[k] = True
            # note that this is slightly faster than the formula
            # on Wikipedia, k = n*k % (n*m-1)      
            i = k // m
            j = k - i*m            
            k = n*j + i
            if k == k0:
                break
            permut[q] = k
            q += 1
            
        # apply cyclic permutation
        tmp = aflat[permut[q-1]]
        aflat[permut[1:q]] = aflat[permut[:q-1]]
        aflat[permut[0]] = tmp            
    
    aT = aflat.reshape(m, n)
    return aT

def test_transpose(n, m):
    a = np.arange(n*m).reshape(n, m)
    aT = a.T.copy()
    assert np.all(transpose_in_place(a) == aT)

def roundtrip_inplace(a):
    a = transpose_in_place(a)
    a = transpose_in_place(a)
    
def roundtrip_copy(a):
    a = a.T.copy()
    a = a.T.copy()

if __name__ == '__main__':
    test_transpose(1, 1)
    test_transpose(3, 4)
    test_transpose(5, 5)
    test_transpose(1, 5)
    test_transpose(5, 1)
    test_transpose(19, 29)

Even though I'm using numba.njit here so that the loops in the transpose function are compiled, it's still quite a bit slower than a copy-transpose.

n, m = 1000, 10000
a_big = np.arange(n*m, dtype=np.float64).reshape(n, m)

%timeit -r2 -n10 roundtrip_copy(a_big)
54.5 ms ± 153 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)

%timeit -r2 -n1 roundtrip_inplace(a_big)
614 ms ± 141 ms per loop (mean ± std. dev. of 2 runs, 1 loop each) 

Upvotes: 1

hpaulj
hpaulj

Reputation: 231355

It may be worth looking at what transpose does, just so we are clear about what you mean by 'physically tranposing'.

Start with a small (4,3) array:

In [51]: arr = np.array([[1,2,3],[10,11,12],[22,23,24],[30,32,34]])             
In [52]: arr                                                                    
Out[52]: 
array([[ 1,  2,  3],
       [10, 11, 12],
       [22, 23, 24],
       [30, 32, 34]])

This is stored with a 1d data buffer, which we can display with ravel:

In [53]: arr.ravel()                                                            
Out[53]: array([ 1,  2,  3, 10, 11, 12, 22, 23, 24, 30, 32, 34])

and strides which tell it to step columns by 8 bytes, and rows by 24 (3*8):

In [54]: arr.strides                                                            
Out[54]: (24, 8)

We can ravel with the "F" order - that's going down the rows:

In [55]: arr.ravel(order='F')                                                   
Out[55]: array([ 1, 10, 22, 30,  2, 11, 23, 32,  3, 12, 24, 34])

While [53] is a view, [55] is a copy.

Now the transpose:

In [57]: arrt=arr.T                                                             
In [58]: arrt                                                                   
Out[58]: 
array([[ 1, 10, 22, 30],
       [ 2, 11, 23, 32],
       [ 3, 12, 24, 34]])

This a view; we can tranverse the [53] data buffer, going down rows with 8 byte steps. Doing calculations with the arrt is basically just as fast as with arr. With the strided iteration, order 'F' is just as fast as order 'C'.

In [59]: arrt.strides                                                           
Out[59]: (8, 24)

the original order:

In [60]: arrt.ravel(order='F')                                                  
Out[60]: array([ 1,  2,  3, 10, 11, 12, 22, 23, 24, 30, 32, 34])

but doing a 'C' ravel creates a copy, same as [55]

In [61]: arrt.ravel(order='C')                                                  
Out[61]: array([ 1, 10, 22, 30,  2, 11, 23, 32,  3, 12, 24, 34])

Doing a copy of the transpose makes an array that's transpose with 'C' order. This is your 'physical transpose':

In [62]: arrc = arrt.copy()                                                     
In [63]: arrc.strides                                                           
Out[63]: (32, 8)

Reshaping a transpose as done with [61] does make a copy, but usually we don't need to explicitly make the copy. I think the only reason to do so is to avoid several redundant copies in later calculations.

Upvotes: 2

ATOMP
ATOMP

Reputation: 1489

Whatever you do will require O(n^2) time and memory. I would assume that .transpose and .copy (written in C) will be the most efficient choice for your application.

Edit: this assumes you actually need to copy the matrix

Upvotes: 1

Related Questions