xzk
xzk

Reputation: 877

sklearn/PCA - Error while trying to transform the high dimentional data

I encountered data error while trying to convert my high dimensional vector into 2 dimension using PCA.

This is my input data, each row has 300 dimensions:

                                                  vector
0      [0.01053525, -0.007869658, 0.0024931028, -0.04...
1      [-0.024436072, -0.016484523, 0.03859031, 0.000...
2      [0.015011676, -0.020465894, 0.004854744, -0.00...
3      [-0.010836455, -0.006562917, 0.00265073, 0.022...
4      [-0.018123362, -0.026007563, 0.04781856, -0.03...
...                                                  ...
45124  [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125  [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126  [-0.021875003, -0.005635035, 0.0076896898, -0....
45127  [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128  [0.007794927, 0.0019561667, 0.15995999, -0.054...

[45129 rows x 1 columns]

My Code:

data = pd.read_parquet('1.parquet', engine='fastparquet')

reduced = pca.fit_transform(data)

Error:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.

Edit

>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   vector  45129 non-null  object
dtypes: object(1)
memory usage: 352.7+ KB


Upvotes: 0

Views: 394

Answers (1)

Nick Becker
Nick Becker

Reputation: 4224

Scikit-learn doesn't know how to handle a column that contains an array (list), so you'll need to expand the column. Since each row has an array of the same size, you can do this fairly easily with only 45,000 rows. Once you expand your data, you should be fine.

import pandas as pd
from sklearn.decomposition import PCA
​
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0   1   2
0   0.01    0.02    0.03
1   0.04    0.40    0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01,  1.43048962e-17],
       [-1.93778224e-01,  1.43048962e-17]])

Upvotes: 1

Related Questions