Jason
Jason

Reputation: 47

TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray' whilst trying to do PCA

I'm trying to do PCA on a sparse matrix, but I am encountering an error:

TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

Here is my code:

import sys
import csv
from sklearn.decomposition import PCA

data_sentiment = []
y = []
data2 = []
csv.field_size_limit(sys.maxint)
with open('/Users/jasondou/Google Drive/data/competition_1/speech_vectors.csv') as infile:
    reader = csv.reader(infile, delimiter=',', quotechar='|')
    n = 0
    for row in reader:
        # sample = row.split(',')
        n += 1
        if n%1000 == 0:
            print n
        data_sentiment.append(row[:25000])

pca = PCA(n_components=3)
pca.fit(data_sentiment)
PCA(copy=True, n_components=3, whiten=False)
print(pca.explained_variance_ratio_) 
y = pca.transform(data_sentiment)

The input data is speech_vector.csv, which a 2740 * 50000 matrix found available here

Here is the full error traceback:

Traceback (most recent call last):
  File "test.py", line 45, in <module>
    y = pca.transform(data_sentiment)
  File "/Users/jasondou/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 397, in transform
    X = X - self.mean_
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

I do not quite understand what self.mean_ refers to here.

Upvotes: 0

Views: 4534

Answers (1)

ali_m
ali_m

Reputation: 74252

You are not parsing the CSV file correctly. Each row that your reader returns will be a list of strings, like this:

row = ['0.0', '1.0', '2.0', '3.0', '4.0']

Your data_sentiment will therefore be a list-of-lists-of-strings, for example:

data_sentiment = [row, row, row]

When you pass this directly to pca.fit(), it is internally converted to a numpy array, also containing strings:

X = np.array(data_sentiment)
print(repr(X))
# array([['0.0', '1.0', '2.0', '3.0', '4.0'],
#        ['0.0', '1.0', '2.0', '3.0', '4.0'],
#        ['0.0', '1.0', '2.0', '3.0', '4.0']], 
#       dtype='|S3')

numpy has no rule for subtracting an array of strings from another array of strings:

X - X
# TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

This mistake would have been very easy to spot if you had bothered to show us some of the contents of data_sentiment in your question, as I asked you to.


What you need to do is convert your strings to floats, for example:

data_sentiment.append([float(s) for s in row[:25000]])

A much easier way would be to use np.loadtxt to parse the CSV file:

data_sentiment = np.loadtxt('/path/to/file.csv', delimiter=',')

If you have pandas installed, then pandas.read_csv will probably be faster than np.loadtxt for a large array such as this one.

Upvotes: 1

Related Questions