Aris F.
Aris F.

Reputation: 1117

Use CountVectorizer with file where every line is a document

I have a huge file where I want to treat every line as a document and use CountVectorizer to create the vectors.

What I have tried so far:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input='file', decode_error='ignore', strip_accents='unicode')
corpus = open('corpus.txt')
vectors = vectorizer.fit_transform([corpus]).toarray()
print vectors
print vectorizer.vocabulary_

The file corpus.txt

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system

What I expect is to get an array with three vectors. Instead I get an array with one vector:

[[1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2]]
{u'lab': 7, u'eps': 3, u'applications': 1, u'management': 9, u'user': 17, u'human': 5, u'interface': 6, u'response': 12, u'abc': 0, u'for': 4, u'of': 10, u'system': 14, u'machine': 8, u'computer': 2, u'survey': 13, u'time': 16, u'opinion': 11, u'the': 15}

How should I proceed?

Upvotes: 2

Views: 4888

Answers (3)

Danish
Danish

Reputation: 912

Modified code:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(input='file', decode_error='ignore', strip_accents='unicode')
corpus = open('corpus.txt')
docs = corpus.split("\n") 
vectors = vectorizer.fit_transform(docs)
print vectors
print vectorizer.vocabulary_

This line docs = corpus.split("\n") divides your corpus into separate documents until the split function does not get newline.

Upvotes: 0

Ryan Walker
Ryan Walker

Reputation: 3286

Careful, from the documentation the input=file argument to CountVectorizer has:

If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.

The read method called on a file will read the entire text as a single string into memory. So with [corpus] you get a single string representing the entire text of your file.

Why not do the following instead?

vectorizer = CountVectorizer(decode_error='ignore',strip_accents='unicode')
corpus = open('corpus.txt')
vectors = vectorizer.fit_transform(corpus).toarray()

You can just pass the file handle corpus directly to fit since fit accepts an iterator. That should allow you to build the vectorizer without reading the entire file into memory.

Upvotes: 3

simon
simon

Reputation: 2831

You are passing it the whole file. If you want to do it line by line you need a loop that passes one row at a time into the CV and returns one vector. You still use just the one CV object just call fit_transform multiple times.

Alternatively you can read it into a pandas dataframe and then use apply but probably the timing will be similar.

Upvotes: 0

Related Questions