Reputation: 4255

Convert 3d numpy array into a 2d numpy array (where contents are tuples)

I have the following 3d numpy array np.random.rand(6602, 3176, 2). I would like to convert it to a 2d array (numpy or pandas.DataFrame), where each value inside is a tuple, such that the shape is (6602, 3176).

This questioned helped me see how to decrease the dimensions, but I still struggle with the tuple bit.

Upvotes: 9

Answers (4)

Paul Panzer

Reputation: 53029

Here is a one-liner which takes a few seconds on the full (6602, 3176, 2) problem

a = np.random.rand(6602, 3176, 2)

b = a.view([(f'f{i}',a.dtype) for i in range(a.shape[-1])])[...,0].astype('O')

The trick here is to viewcast to a compund dtype which spans exactly one row. When such a compound dtype is then cast on to object each compound element is converted to a tuple.

UPDATE (hat tip @hpaulj) there is a library function that does precisely the view casting we do manually: numpy.lib.recfunctions.unstructured_to_structured

Using this we can write a more readable version of the above:

import numpy.lib.recfunctions as nlr

b = nlr.unstructured_to_structured(a).astype('O')

Upvotes: 10

norok2

Reputation: 26896

If you are happy with list instead of tuple, this could be achieved with the following trick:

convert your array to list of lists using .tolist()
make sure that you change the size of one of the innermost list (misalign)
convert the list of lists back to NumPy array
fix the modification of point 2.

This is implemented in the following function last_dim_as_list():

import numpy as np


def last_dim_as_list(arr):
    if arr.ndim > 1:
        # : convert to list of lists
        arr_list = arr.tolist()
        # : misalign size of the first innermost list
        temp = arr_list
        for _ in range(arr.ndim - 1):
            temp = temp[0]
        temp.append(None)
        # : convert to NumPy array
        # (uses `object` because of the misalignment)
        result = np.array(arr_list)
        # : revert the misalignment
        temp.pop()
    else:
        result = np.empty(1, dtype=object)
        result[0] = arr.tolist()
    return result

np.random.seed(0)
in_arr = np.random.randint(0, 9, (2, 3, 2))
out_arr = last_dim_as_list(in_arr)


print(in_arr)
# [[[5 0]
#   [3 3]
#   [7 3]]
#  [[5 2]
#   [4 7]
#   [6 8]]]
print(in_arr.shape)
# (2, 3, 2)
print(in_arr.dtype)
# int64

print(out_arr)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
#  [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr.shape)
# (2, 3)
print(out_arr.dtype)
# object

However, I would NOT recommend taking this route unless you really know what you are doing. Most of the time you are better off by keeping everything as a NumPy array of higher dimensionality, and make good use of NumPy indexing.

Note that this could also be done with explicit loops, but the proposed approach should be much faster for large enough inputs:

def last_dim_as_list_loop(arr):
    shape = arr.shape
    result = np.empty(arr.shape[:-1], dtype=object).ravel()
    for k in range(arr.shape[-1]):
        for i in range(result.size):
            if k == 0:
                result[i] = []
            result[i].append(arr[..., k].ravel()[i])
    return result.reshape(shape[:-1])


out_arr2 = last_dim_as_list_loop(in_arr)

print(out_arr2)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
#  [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr2.shape)
# (2, 3)
print(out_arr2.dtype)
# object

But the timings for this last are not exactly spectacular:

%timeit last_dim_as_list(in_arr)
# 2.53 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit last_dim_as_list_loop(in_arr)
# 12.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The view-based approach proposed by @PaulPanzer is very elegant and more efficient than the trick proposed in last_dim_as_list() because it loops (internally) through the array only once as compared to twice:

def last_dim_as_tuple(arr):
    dtype = [(str(i), arr.dtype) for i in range(arr.shape[-1])]
    return arr.view(dtype)[..., 0].astype(object)

and therefore the timings on large enough inputs are more favorable:

in_arr = np.random.random((6602, 3176, 2))


%timeit last_dim_as_list(in_arr)
# 4.9 s ± 73.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit last_dim_as_tuple(in_arr)
# 3.07 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 0

dtrckd

Reputation: 662

A vectorized approach (it's a bit tricky):

mat = np.random.rand(6602, 3176, 2)

f = np.vectorize(lambda x:tuple(*x.items()), otypes=[np.ndarray])
mat2 = np.apply_along_axis(lambda x:dict([tuple(x)]), 2, mat)
mat2 = np.vstack(f(mat2))

mat2.shape
Out: (6602, 3176)

type(mat2[0,0])
Out: tuple

Upvotes: 0

AnsFourtyTwo

Reputation: 2518

If you really want to do, what you want to do, you have to set dtype of you array to object. E.g., if you have the mentioned array:

a = np.random.rand(6602, 3176, 2)

You could create a second empty array with shape (6602, 3176) and set dtype to object:

b = np.empty(a[:,:,0].shape, dtype=object)

and fill your array with tuples.

But in the end there is no big advantage! I'd just use slicing to get the tuples from your initial array a. You can just access the tuples of indexes n (1st dimension) and m (2nd dimension) and forget about the third dimension and slice your 3d array:

a[n,m,:]

Upvotes: 2

Convert 3d numpy array into a 2d numpy array (where contents are tuples)

Answers (4)

Related Questions