Ophilia
Ophilia

Reputation: 717

Text classification with Scikit-learn

I am doing text classification for two labels with scikit learn .. I am loading my text files with the method load_files

categories={'label0','label1'}
text_data = load_files(path,categories=categories)

from the following structure:

train
├── Label0
│   ├── 0001.txt
│   └── 0002.txt
└── Label1
    ├── 0001.txt
    └── 0002.txt

my problem is that when I try to look at the shape of text_data.data it returns:

print (type(text_data.data))
<type 'list'>

print text_data.data.shape
AttributeError: 'list' object has no attribute 'shape'

X = np.array(text_data.data)
print x.shape
(35,)

it returns 1D array .. I thought it should be 2D numpy array or a dictionary where the first will be for the text and the other one will be for the class (label0 or 1 ) .. have I missed something ?

Upvotes: 0

Views: 437

Answers (1)

David Maust
David Maust

Reputation: 8270

The problem is after calling load_files, it is not yet a numpy array. It is just a list of text. You should vectorize this text using CountVectorizer or TfidfVectorizer.

Example:

cv = CountVectorizer()
X = cv.fit_transform(text_data.data)
y = text_data.target
print cv.vocabulary_  # Show words in vocabulary with column index

clf = LogisticRegression() # or other classifier
clf.fit(X, y)

Upvotes: 1

Related Questions