Reputation: 4255
I have the following 3d numpy array np.random.rand(6602, 3176, 2)
.
I would like to convert it to a 2d array (numpy
or pandas.DataFrame
), where each value inside is a tuple, such that the shape is (6602, 3176)
.
This questioned helped me see how to decrease the dimensions, but I still struggle with the tuple bit.
Upvotes: 9
Views: 2692
Reputation: 53029
Here is a one-liner which takes a few seconds on the full (6602, 3176, 2) problem
a = np.random.rand(6602, 3176, 2)
b = a.view([(f'f{i}',a.dtype) for i in range(a.shape[-1])])[...,0].astype('O')
The trick here is to viewcast to a compund dtype which spans exactly one row. When such a compound dtype is then cast on to object each compound element is converted to a tuple.
UPDATE (hat tip @hpaulj) there is a library function that does precisely the view casting we do manually: numpy.lib.recfunctions.unstructured_to_structured
Using this we can write a more readable version of the above:
import numpy.lib.recfunctions as nlr
b = nlr.unstructured_to_structured(a).astype('O')
Upvotes: 10
Reputation: 26896
If you are happy with list
instead of tuple
, this could be achieved with the following trick:
list
of list
s using .tolist()
list
(misalign)list
of list
s back to NumPy arrayThis is implemented in the following function last_dim_as_list()
:
import numpy as np
def last_dim_as_list(arr):
if arr.ndim > 1:
# : convert to list of lists
arr_list = arr.tolist()
# : misalign size of the first innermost list
temp = arr_list
for _ in range(arr.ndim - 1):
temp = temp[0]
temp.append(None)
# : convert to NumPy array
# (uses `object` because of the misalignment)
result = np.array(arr_list)
# : revert the misalignment
temp.pop()
else:
result = np.empty(1, dtype=object)
result[0] = arr.tolist()
return result
np.random.seed(0)
in_arr = np.random.randint(0, 9, (2, 3, 2))
out_arr = last_dim_as_list(in_arr)
print(in_arr)
# [[[5 0]
# [3 3]
# [7 3]]
# [[5 2]
# [4 7]
# [6 8]]]
print(in_arr.shape)
# (2, 3, 2)
print(in_arr.dtype)
# int64
print(out_arr)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
# [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr.shape)
# (2, 3)
print(out_arr.dtype)
# object
However, I would NOT recommend taking this route unless you really know what you are doing. Most of the time you are better off by keeping everything as a NumPy array of higher dimensionality, and make good use of NumPy indexing.
Note that this could also be done with explicit loops, but the proposed approach should be much faster for large enough inputs:
def last_dim_as_list_loop(arr):
shape = arr.shape
result = np.empty(arr.shape[:-1], dtype=object).ravel()
for k in range(arr.shape[-1]):
for i in range(result.size):
if k == 0:
result[i] = []
result[i].append(arr[..., k].ravel()[i])
return result.reshape(shape[:-1])
out_arr2 = last_dim_as_list_loop(in_arr)
print(out_arr2)
# [[list([5, 0]) list([3, 3]) list([7, 3])]
# [list([5, 2]) list([4, 7]) list([6, 8])]]
print(out_arr2.shape)
# (2, 3)
print(out_arr2.dtype)
# object
But the timings for this last are not exactly spectacular:
%timeit last_dim_as_list(in_arr)
# 2.53 µs ± 37.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit last_dim_as_list_loop(in_arr)
# 12.2 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The view
-based approach proposed by @PaulPanzer is very elegant and more efficient than the trick proposed in last_dim_as_list()
because it loops (internally) through the array only once as compared to twice:
def last_dim_as_tuple(arr):
dtype = [(str(i), arr.dtype) for i in range(arr.shape[-1])]
return arr.view(dtype)[..., 0].astype(object)
and therefore the timings on large enough inputs are more favorable:
in_arr = np.random.random((6602, 3176, 2))
%timeit last_dim_as_list(in_arr)
# 4.9 s ± 73.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit last_dim_as_tuple(in_arr)
# 3.07 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 0
Reputation: 662
A vectorized approach (it's a bit tricky):
mat = np.random.rand(6602, 3176, 2)
f = np.vectorize(lambda x:tuple(*x.items()), otypes=[np.ndarray])
mat2 = np.apply_along_axis(lambda x:dict([tuple(x)]), 2, mat)
mat2 = np.vstack(f(mat2))
mat2.shape
Out: (6602, 3176)
type(mat2[0,0])
Out: tuple
Upvotes: 0
Reputation: 2518
If you really want to do, what you want to do, you have to set dtype
of you array to object
. E.g., if you have the mentioned array:
a = np.random.rand(6602, 3176, 2)
You could create a second empty array with shape (6602, 3176) and set dtype
to object
:
b = np.empty(a[:,:,0].shape, dtype=object)
and fill your array with tuples.
But in the end there is no big advantage! I'd just use slicing to get the tuples from your initial array a
. You can just access the tuples of indexes n
(1st dimension) and m
(2nd dimension) and forget about the third dimension and slice your 3d array:
a[n,m,:]
Upvotes: 2