Mauro Gentile
Mauro Gentile

Reputation: 1511

series of list to multidimensional np array

I have a pandas dataframe df. One column is a string of numbers (as characters) divided by blank space

I need to convert it to multidim numpy array.

I thought that :

df.A.apply(lambda x: np.array(x.split(" "))).values

would make the trick

Actually it returns an array of array....

array([array(['70', '80', '82', ..., '106', '109', '82'], dtype='<U3'),
   array(['151', '150', '147', ..., '193', '183', '184'], dtype='<U3'),

Which does not seem to be what I look what i am looking for whcih should rather look like

array([[[['70', '80', '82', ..., '106', '109', '82'],['151', '150', '147', ..., '193', '183', '184']....

First: what shoudl I do to have my daya in the second format? Second: I am actually a bit confused about the difference between the 2 data structures. In the end of the day a multidimensional array is an array of arrays. From this perspective it would seem that the 2 are the same structure. But I am sure I am missing somthing

EXAMPLE:

df=pd.DataFrame({"A":[0,1,2,3],"B":["1 2 3 4","5 6 7 8","9 10 11 12","13 14 15 16"]})

    A   B
0   0   "1 2 3 4"
1   1   "5 6 7 8"
2   2   "9 10 11 12"
3   3   "13 14 15 16"

This command

df.B.apply(lambda x: np.array(x.split(" "))).values

gives:

array([array(['1', '2', '3', '4'], dtype='<U1'),
   array(['5', '6', '7', '8'], dtype='<U1'),
   array(['9', '10', '11', '12'], dtype='<U2'),
   array(['13', '14', '15', '16'], dtype='<U2')], dtype=object)

instead of

 array([['1', '2', '3', '4'],
   ['5', '6', '7', '8'],
   ['9', '10', '11', '12'],
   ['13', '14', '15', '16']], dtype='<U2')

Question1: How do I get this last structure? Question2: what is the difference between the 2? Technically are both array of arrays...

Upvotes: 2

Views: 154

Answers (1)

Ben.T
Ben.T

Reputation: 29635

you can do it using str.split on df.A directly, with the parameter expand=True and then use values such as:

df = pd.DataFrame({'A':['70 80 82','151 150 147']})
print (df.A.str.split(' ',expand=True).values)
array([['70', '80', '82'],
       ['151', '150', '147']], dtype=object)

with your method, if all the strings contain the same amount of numbers, you can still use np.stack to get the same result:

print (np.stack(df.A.apply(lambda x: np.array(x.split(" "))).values))

EDIT: for the difference, not sure I can explain it good enough but I try. let's define

arr1 = df.A.str.split(' ',expand=True).values
arr2 = df.A.apply(lambda x: np.array(x.split(" "))).values

First you can notice that the shape is not the same:

print(arr1.shape)
(2, 3)
print(arr2.shape)
(2,)

so I would say one difference is that arr2 is a 1D array of elements that happens to be also 1D array. When you construct arr2 with values, it constructs a 1D array from the serie df.A.apply(lambda x: np.array(x.split(" "))) without looking at the type in this serie. For arr1, the difference is that df.A.str.split(' ',expand=True) is not a serie but a dataframe, so using values will construct an 2D array with a shape being (number of rows,nb of columns). In both case you use values, but actually having an array in a cell of a serie (as created in your method) will not create a 2D array.

Then, if you want to access any element (such as the first row second element) you can do it by arr1[0,1] while arr2[0,1] will throw an error because this structure is not a 2D array, but arr2[0][1] gives the good answer because you access the second element [1] of the first 1D array [0] in arr2.

I hope it gives some explanation.

Upvotes: 3

Related Questions