Reputation: 463
I have a pandas dataframe with a column of vectors that I would like to perform matrix arithmetic on. However, upon closer inspection the vectors are all wrapped as strings with new line characters seemingly embedded in them:
How do I convert each vector in this column into numpy arrays? I've tried
df['Word Vector'].as_matrix
and
np.array(df['Word Vector'])
as well as
df['Word Vector'] = df['Word Vector'].astype(np.array)
but none produced the desired result. Any pointers would be appreciated!
Upvotes: 6
Views: 15480
Reputation: 7343
This worked for me for string lists in a Pandas column:
df['Numpy Word Vector'] = df['Word Vector'].apply(eval).apply(np.array)
Upvotes: 2
Reputation: 8585
The solution below is shorter:
df[col_name] = df[col_name].apply(lambda x: np.array(eval(x)), 0)
Example:
df = pd.DataFrame(['[0., 1., 2., 3.]', '[1., 2., 3., 4.]'], columns=['Word Vector'])
df['Word Vector'][0] # '[0., 1., 2., 3.]'
df['Word Vector'] = df['Word Vector'].apply(lambda x: np.array(eval(x)), 0)
df['Word Vector'][0] # array([0., 1., 2., 3.])
Upvotes: 0
Reputation: 637
Hope the following works as what you expected
import pandas as pd
import numpy as np
x = str(np.arange(1,100))
df = pd.DataFrame([x,x,x,x])
df.columns = ['words']
print 'sample'
print df.head()
result = df['words'].apply(lambda x:
np.fromstring(
x.replace('\n','')
.replace('[','')
.replace(']','')
.replace(' ',' '), sep=' '))
print 'result'
print result
output as following
sample
words
0 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
1 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
2 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
3 [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
result
0 [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
1 [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
2 [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
3 [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ...
It is not elegant to call replace function so many times. However I did not find better approach. Anyway it should help you to convert string to vectors.
A side note, as data is presented in picture, You'd better check whether your data separation is done by space or tab. If it is tab, change sep=' ' to sep='\t'
Upvotes: 15