Reputation: 317
I am stuck at a problem where I have to add an additional feature (average word length) to a list of token counts created by CountVectorizer function of scikit learn. Say I have the following code:
#list of tweets
texts = [(list of tweets)]
#list of average word length of every tweet
average_lengths = word_length(tweets)
#tokenizer
count_vect = CountVectorizer(analyzer = 'word', ngram_range = (1,1))
x_counts = count_vect.fit_transform(texts)
The format should be (tokens, average word length) for every instance. My initial idea was to simply concatenate the two lists using the zip-function like this:
x = zip(x_counts, average_lengths)
but then I get an error when I try to fit my model:
ValueError: setting an array element with a sequence.
Anyone have any idea how to solve this problem?
Upvotes: 6
Views: 3818
Reputation: 1373
You can write your own transformer like in this article which give you average word length of every tweet and use FeatureUnion:
vectorizer = FeatureUnion([
('cv', CountVectorizer(analyzer = 'word', ngram_range = (1,1))),
('av_len', AverageLenVectizer(...))
])
Upvotes: 5
Reputation: 1488
Because CountVectorizer returns a sparse matrix, you need to perform sparse matrix operations on it. You can do so by using hstack
from scipy.sparse
.
For example (taken from scipy's documentation):
from scipy.sparse import coo_matrix, hstack
A = coo_matrix([[1, 2], [3, 4]])
B = coo_matrix([[5], [6]])
hstack([A,B]).toarray()
array([[1, 2, 5],
[3, 4, 6]])
Upvotes: 2