Reputation: 1117
I have a huge file where I want to treat every line as a document and use CountVectorizer to create the vectors.
What I have tried so far:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input='file', decode_error='ignore', strip_accents='unicode')
corpus = open('corpus.txt')
vectors = vectorizer.fit_transform([corpus]).toarray()
print vectors
print vectorizer.vocabulary_
The file corpus.txt
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
What I expect is to get an array with three vectors. Instead I get an array with one vector:
[[1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2]]
{u'lab': 7, u'eps': 3, u'applications': 1, u'management': 9, u'user': 17, u'human': 5, u'interface': 6, u'response': 12, u'abc': 0, u'for': 4, u'of': 10, u'system': 14, u'machine': 8, u'computer': 2, u'survey': 13, u'time': 16, u'opinion': 11, u'the': 15}
How should I proceed?
Upvotes: 2
Views: 4888
Reputation: 912
Modified code:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input='file', decode_error='ignore', strip_accents='unicode')
corpus = open('corpus.txt')
docs = corpus.split("\n")
vectors = vectorizer.fit_transform(docs)
print vectors
print vectorizer.vocabulary_
This line docs = corpus.split("\n")
divides your corpus into separate documents until the split function does not get newline.
Upvotes: 0
Reputation: 3286
Careful, from the documentation the input=file
argument to CountVectorizer has:
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
The read method called on a file will read the entire text as a single string into memory. So with [corpus]
you get a single string representing the entire text of your file.
Why not do the following instead?
vectorizer = CountVectorizer(decode_error='ignore',strip_accents='unicode')
corpus = open('corpus.txt')
vectors = vectorizer.fit_transform(corpus).toarray()
You can just pass the file handle corpus
directly to fit since fit accepts an iterator. That should allow you to build the vectorizer without reading the entire file into memory.
Upvotes: 3
Reputation: 2831
You are passing it the whole file. If you want to do it line by line you need a loop that passes one row at a time into the CV and returns one vector. You still use just the one CV object just call fit_transform multiple times.
Alternatively you can read it into a pandas dataframe and then use apply but probably the timing will be similar.
Upvotes: 0