Reputation: 323
I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.
When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.
My numpy array looks like this:
np_arr =
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]
And I want to index columns 1,3.
I have my columns stored in a list as ix = [1,3]
However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .
I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).
Could you please tell me how to fix my issue.
Thanks!
Edit:
I'm creating my numpy object from my pandas dataframe like this:
def _to_numpy(self, data):
v = data.reset_index()
np_res = np.rec.fromrecords(v, names=v.columns.tolist())
return(np_res)
Upvotes: 0
Views: 262
Reputation: 231335
You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).
To illustrate:
In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [427]: df
Out[427]:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [428]: arr = df.to_records()
In [429]: arr
Out[429]:
rec.array([(0, 0, 1, 2), (1, 3, 4, 5), (2, 6, 7, 8), (3, 9, 10, 11)],
dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape
Out[431]: (4,)
I believe to_records
has a parameter to eliminate the index field.
Or with your method:
In [432]:
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())
In [433]: arr
Out[433]:
rec.array([(0, 1, 2), (3, 4, 5), (6, 7, 8), (9, 10, 11)],
dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A'] # arr.A also works
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape
Out[435]: (4,)
And multifield access:
In [436]: arr[['A','C']]
Out[436]:
rec.array([(0, 2), (3, 5), (6, 8), (9, 11)],
dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})
Note that the str
display of this array
In [437]: print(arr)
[(0, 1, 2) (3, 4, 5) (6, 7, 8) (9, 10, 11)]
shows a list of tuples, just as your np_arr
. Each tuple is a 'record'. The repr
display shows the dtype
as well.
You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.
In [438]: arr = df.to_numpy()
In [439]: arr
Out[439]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Upvotes: 0
Reputation: 13
The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
Upvotes: 1