Tim
Tim

Reputation: 317

Add additional feature to CountVectorizer matrix

I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code:

#list of tweets
texts = [(list of tweets)]

#list of average word length of every tweet
average_lengths = word_length(tweets)

#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)

The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this:

x = zip(x_counts, average_lengths)

but then I get an error when I try to fit my model:

ValueError: setting an array element with a sequence.   

Anyone have any idea how to solve this problem?

Upvotes: 6

Views: 3818

Answers (2)

Andrei
Andrei

Reputation: 1373

You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion:

vectorizer = FeatureUnion([
        ('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
        ('av_len', AverageLenVectizer(...))
    ])

Upvotes: 5

Bunny_Ross
Bunny_Ross

Reputation: 1488

Because CountVectorizer returns a sparse matrix, you need to perform sparse matrix operations on it. You can do so by using hstack from scipy.sparse.

For example (taken from scipy's documentation):

from scipy.sparse import coo_matrix, hstack
A = coo_matrix([[1, 2], [3, 4]])
B = coo_matrix([[5], [6]])
hstack([A,B]).toarray()
array([[1, 2, 5],
   [3, 4, 6]])

Upvotes: 2

Related Questions