Eka
Eka

Reputation: 15000

How to vectorize and devectorize using sklearn's CountVectorizer?

I want to vectorize some text to corresponding integers and then convert those text to its mapped integers and also create new sentence using new input integers [2,9,39,46,56,12,89,9].

I have seen some custom functions which can used for this purpose but I want to know whether sklearn itself has such functions.

from sklearn.feature_extraction.text import CountVectorizer

a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim. 
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque. 
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus. 
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula 
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo 
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae 
hendrerit vel, vulputate quis metus."""]


vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_

#convert text to corresponding vectors
mapped_a=

#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence

new_sentence=

Upvotes: 1

Views: 4639

Answers (2)

Jakub Macina
Jakub Macina

Reputation: 982

For vectorizing sentence into integers you can use transform function. Output of this function is vector with counts for each term - feature vector.

vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_

new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector

tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
    print vec.vocabulary_.get(token)

The second part of the question is not so straightforward. CountVectorizer has inverse_transform function for this purpose with a sparse vector of features as an input. However, in your example you would like to create a sentence where same terms might occur and with that function it is not possible.

However, the solution is to use vocabulary (word to id) and building inverse vocabulary (id to word) based on it. CountVectorizer by default has no inverse_vocabulary and you must create it based on the vocabulary.

input = [2,9,9]

# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']


# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]

for i in input:
    print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']

Upvotes: 5

DarshanGowda0
DarshanGowda0

Reputation: 462

Take a look at preprocessing libraries in sklearn, LabelEncoder and OneHotEncoder are usually used to encode categorical variables. But encoding the whole text is not recommended!

Upvotes: -1

Related Questions