Shubhankar Agrawal
Shubhankar Agrawal

Reputation: 323

Numpy unable to access columns

I'm working on a ML project for which I'm using numpy arrays instead of pandas for faster computation.

When I intend to bootstrap, I wish to subset the columns from a numpy ndarray.

My numpy array looks like this:

np_arr =   
[(187., 14.45 , 20.22, 94.49)
(284., 10.44 , 15.46, 66.62)
(415., 11.13 , 22.44, 71.49)]

And I want to index columns 1,3.

I have my columns stored in a list as ix = [1,3]

However, when I try to do np_arr[:,ix] I get an error saying too many indices for array .

I also realised that when I print np_arr.shape I only get (3,), whereas I probably want (3,4).

Could you please tell me how to fix my issue.

Thanks!

Edit:

I'm creating my numpy object from my pandas dataframe like this:

def _to_numpy(self, data):
        v = data.reset_index()
        np_res = np.rec.fromrecords(v, names=v.columns.tolist())
        return(np_res)

Upvotes: 0

Views: 262

Answers (2)

hpaulj
hpaulj

Reputation: 231335

You have created a record array (also called a structured array). The result is a 1d array with named columns (fields).

To illustrate:

In [426]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])                 
In [427]: df                                                                                   
Out[427]: 
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
In [428]: arr = df.to_records()                                                                
In [429]: arr                                                                                  
Out[429]: 
rec.array([(0, 0,  1,  2), (1, 3,  4,  5), (2, 6,  7,  8), (3, 9, 10, 11)],
          dtype=[('index', '<i8'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [430]: arr['A']                                                                             
Out[430]: array([0, 3, 6, 9])
In [431]: arr.shape                                                                            
Out[431]: (4,)

I believe to_records has a parameter to eliminate the index field.

Or with your method:

In [432]:                                                                                      
In [432]: arr = np.rec.fromrecords(df, names=df.columns.tolist())                              
In [433]: arr                                                                                  
Out[433]: 
rec.array([(0,  1,  2), (3,  4,  5), (6,  7,  8), (9, 10, 11)],
          dtype=[('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
In [434]: arr['A']            # arr.A also works                                                                 
Out[434]: array([0, 3, 6, 9])
In [435]: arr.shape                                                                            
Out[435]: (4,)

And multifield access:

In [436]: arr[['A','C']]                                                                       
Out[436]: 
rec.array([(0,  2), (3,  5), (6,  8), (9, 11)],
          dtype={'names':['A','C'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})

Note that the str display of this array

In [437]: print(arr)                                                                           
[(0,  1,  2) (3,  4,  5) (6,  7,  8) (9, 10, 11)]

shows a list of tuples, just as your np_arr. Each tuple is a 'record'. The repr display shows the dtype as well.

You can't have it both ways, either access columns by name, or make a regular numpy array and access columns by number. The named/record access makes most sense when columns are a mix of dtypes - string, int, float. If they are all float, and you want to do calculations across columns, its better to use the numeric dtype.

In [438]: arr = df.to_numpy()                                                                  
In [439]: arr                                                                                  
Out[439]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

Upvotes: 0

Neelansh Sahai
Neelansh Sahai

Reputation: 13

The reason here for your issue is that the np_arr which you have is a 1-D array. Share your code snippet as well so that it can be looked into as in what is the exact issue. But in general, while dealing with 2-D numpy arrays, we generally do this.

a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

Here is a small example

Upvotes: 1

Related Questions