pir
pir

Reputation: 5923

Make single column of numpy array another datatype

Given a numpy array my_arr filled with strings, how do I set the datatype of one of the columns to be float? I need it as an numpy array in order to use it with my existing code afterwards. See example of a failed attempt below:

import numpy as np

dat = [['User1', 'Male', '2.2'], ['User2', 'Female', '3.777'], ['User3', 'Unknown', '0.0']]
my_arr = np.array(dat)
print my_arr
# [['User1' 'Male' '2.2'], ['User2' 'Female' '3.777'], ['User3' 'Unknown' '0.0']]

my_arr[:,2] = my_arr[:,2].astype(np.float)
print my_arr
# [['User1' 'Male' '2.2'], ['User2' 'Female' '3.777'], ['User3' 'Unknown' '0.0']]

Upvotes: 4

Views: 1963

Answers (2)

Cleb
Cleb

Reputation: 25997

There might be smarter ways on doing this but the following gives you the correct output I think; you can use structured arrays:

import numpy as np
dat = [['User1', 'Male', '2.2'], ['User2', 'Female', '3.777'], ['User3', 'Unknown', '0.0']]

# create data types: two strings of length 10 and float
dt = np.dtype('a10, a10, float')

# convert the inner lists to tuples so that a structured array can be used
for ind, l in enumerate(dat):
    dat[ind] = tuple(l)

# convert dat to an array
my_arr = np.array(dat, dt)

Output:

array([('User1', 'Male', 2.2), ('User2', 'Female', 3.777),
       ('User3', 'Unknown', 0.0)], 
      dtype=[('f0', 'S10'), ('f1', 'S10'), ('f2', '<f8')])

You can also give names to the columns by doing:

dt = {'names': ['user', 'gender', 'number'], 'formats':['a10', 'a10', 'float']}
my_arr = np.array(dat, dt)  # dat is the list with tuples, see above

The output now is:

array([('User1', 'Male', 2.2), ('User2', 'Female', 3.777),
       ('User3', 'Unknown', 0.0)], 
      dtype=[('user', 'S10'), ('gender', 'S10'), ('number', '<f8')])

And you can then access a single column by doing e.g.

my_arr['number']
array([ 2.2  ,  3.777,  0.   ])

my_arr['user']
array(['User1', 'User2', 'User3'], dtype='|S10')

I would recommend to use a dataframe from Python pandas where you can easily deal with different data types and complex data structures.

For your example:

import pandas as pd
pd.DataFrame(dat, columns=['user', 'gender', 'some number'])

would then simply give you:

    user   gender some number
0  User1     Male         2.2
1  User2   Female       3.777
2  User3  Unknown         0.0

Upvotes: 2

hpaulj
hpaulj

Reputation: 231385

You could convert your 2d array into a structured array with a mixed dtype:

In [137]: my_arr
Out[137]: 
array([['User1', 'Male', '2.2'],
       ['User2', 'Female', '3.777'],
       ['User3', 'Unknown', '0.0']], 
      dtype='<U7')

In [138]: dt=np.dtype('U7,U7,f')  # complex dtype

In [139]: np.array([tuple(row) for row in my_arr], dtype=dt)
Out[139]: 
array([('User1', 'Male', 2.200000047683716),
       ('User2', 'Female', 3.7769999504089355), ('User3', 'Unknown', 0.0)], 
      dtype=[('f0', '<U7'), ('f1', '<U7'), ('f2', '<f4')])

In [140]: _.shape
Out[140]: (3,)

Now it is a 1d array with 3 fields. Instead of accessing columns by number you access fields by name, arr['f0'] etc.

I used [tuple(row) for row in my_arr] because the input to structured arrays has to be a list of tuples. I could have used your dat list, [tuple(row) for row in dat].

Upvotes: 3

Related Questions