Reputation: 1511
I have a pandas dataframe df. One column is a string of numbers (as characters) divided by blank space
I need to convert it to multidim numpy array.
I thought that :
df.A.apply(lambda x: np.array(x.split(" "))).values
would make the trick
Actually it returns an array of array....
array([array(['70', '80', '82', ..., '106', '109', '82'], dtype='<U3'),
array(['151', '150', '147', ..., '193', '183', '184'], dtype='<U3'),
Which does not seem to be what I look what i am looking for whcih should rather look like
array([[[['70', '80', '82', ..., '106', '109', '82'],['151', '150', '147', ..., '193', '183', '184']....
First: what shoudl I do to have my daya in the second format? Second: I am actually a bit confused about the difference between the 2 data structures. In the end of the day a multidimensional array is an array of arrays. From this perspective it would seem that the 2 are the same structure. But I am sure I am missing somthing
EXAMPLE:
df=pd.DataFrame({"A":[0,1,2,3],"B":["1 2 3 4","5 6 7 8","9 10 11 12","13 14 15 16"]})
A B
0 0 "1 2 3 4"
1 1 "5 6 7 8"
2 2 "9 10 11 12"
3 3 "13 14 15 16"
This command
df.B.apply(lambda x: np.array(x.split(" "))).values
gives:
array([array(['1', '2', '3', '4'], dtype='<U1'),
array(['5', '6', '7', '8'], dtype='<U1'),
array(['9', '10', '11', '12'], dtype='<U2'),
array(['13', '14', '15', '16'], dtype='<U2')], dtype=object)
instead of
array([['1', '2', '3', '4'],
['5', '6', '7', '8'],
['9', '10', '11', '12'],
['13', '14', '15', '16']], dtype='<U2')
Question1: How do I get this last structure? Question2: what is the difference between the 2? Technically are both array of arrays...
Upvotes: 2
Views: 154
Reputation: 29635
you can do it using str.split
on df.A
directly, with the parameter expand=True
and then use values
such as:
df = pd.DataFrame({'A':['70 80 82','151 150 147']})
print (df.A.str.split(' ',expand=True).values)
array([['70', '80', '82'],
['151', '150', '147']], dtype=object)
with your method, if all the strings contain the same amount of numbers, you can still use np.stack
to get the same result:
print (np.stack(df.A.apply(lambda x: np.array(x.split(" "))).values))
EDIT: for the difference, not sure I can explain it good enough but I try. let's define
arr1 = df.A.str.split(' ',expand=True).values
arr2 = df.A.apply(lambda x: np.array(x.split(" "))).values
First you can notice that the shape is not the same:
print(arr1.shape)
(2, 3)
print(arr2.shape)
(2,)
so I would say one difference is that arr2
is a 1D array of elements that happens to be also 1D array. When you construct arr2
with values
, it constructs a 1D array from the serie df.A.apply(lambda x: np.array(x.split(" ")))
without looking at the type in this serie. For arr1
, the difference is that df.A.str.split(' ',expand=True)
is not a serie but a dataframe, so using values
will construct an 2D array with a shape being (number of rows,nb of columns)
. In both case you use values
, but actually having an array in a cell of a serie (as created in your method) will not create a 2D array.
Then, if you want to access any element (such as the first row second element) you can do it by arr1[0,1]
while arr2[0,1]
will throw an error because this structure is not a 2D array, but arr2[0][1]
gives the good answer because you access the second element [1]
of the first 1D array [0]
in arr2
.
I hope it gives some explanation.
Upvotes: 3