Numpy unable to access columns

Question

I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.

When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.

My numpy array looks like this:

np_arr =   
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]

And I want to index columns 1,3.

I have my columns stored in a list as ix = [1,3]

However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .

I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).

Could you please tell me how to fix my issue.

Thanks!

Edit:

I'm creating my numpy object from my pandas dataframe like this:

def _to_numpy(self, data):
        v = data.reset_index()
        np_res = np.rec.fromrecords(v, names=v.columns.tolist())
        return(np_res)

hpaulj · Accepted Answer

You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).

To illustrate:

In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])                 
In [427]: df                                                                                   
Out[427]: 
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
In [428]: arr = df.to_records()                                                                
In [429]: arr                                                                                  
Out[429]: 
rec.array([(0, 0,  1,  2), (1, 3,  4,  5), (2, 6,  7,  8), (3, 9, 10, 11)],
          dtype=[('index', '



I believe to_records has a parameter to eliminate the index field.

Or with your method:

In [432]:                                                                                      
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())                              
In [433]: arr                                                                                  
Out[433]: 
rec.array([(0,  1,  2), (3,  4,  5), (6,  7,  8), (9, 10, 11)],
          dtype=[('A', '


And multifield access:

In [436]: arr[['A','C']]                                                                       
Out[436]: 
rec.array([(0,  2), (3,  5), (6,  8), (9, 11)],
          dtype={'names':['A','C'], 'formats':['


Note that the str display of this array

In [437]: print(arr)                                                                           
[(0,  1,  2) (3,  4,  5) (6,  7,  8) (9, 10, 11)]


shows a list of tuples, just as your np_arr.  Each tuple is a 'record'.  The repr display shows the dtype as well.

You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number.  The named/record access makes most sense when columns are a mix of dtypes - string, int, float.  If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.

In [438]: arr = df.to_numpy()                                                                  
In [439]: arr                                                                                  
Out[439]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

Numpy unable to access columns

Answers (2)

Related Questions