Reputation: 877
I encountered data error while trying to convert my high dimensional vector into 2 dimension using PCA.
This is my input data
, each row has 300 dimensions:
vector
0 [0.01053525, -0.007869658, 0.0024931028, -0.04...
1 [-0.024436072, -0.016484523, 0.03859031, 0.000...
2 [0.015011676, -0.020465894, 0.004854744, -0.00...
3 [-0.010836455, -0.006562917, 0.00265073, 0.022...
4 [-0.018123362, -0.026007563, 0.04781856, -0.03...
... ...
45124 [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125 [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126 [-0.021875003, -0.005635035, 0.0076896898, -0....
45127 [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128 [0.007794927, 0.0019561667, 0.15995999, -0.054...
[45129 rows x 1 columns]
My Code:
data = pd.read_parquet('1.parquet', engine='fastparquet')
reduced = pca.fit_transform(data)
Error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.
Edit
>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vector 45129 non-null object
dtypes: object(1)
memory usage: 352.7+ KB
Upvotes: 0
Views: 394
Reputation: 4224
Scikit-learn doesn't know how to handle a column that contains an array (list), so you'll need to expand the column. Since each row has an array of the same size, you can do this fairly easily with only 45,000 rows. Once you expand your data, you should be fine.
import pandas as pd
from sklearn.decomposition import PCA
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0 1 2
0 0.01 0.02 0.03
1 0.04 0.40 0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01, 1.43048962e-17],
[-1.93778224e-01, 1.43048962e-17]])
Upvotes: 1