Reputation: 717
I am doing text classification for two labels with scikit learn .. I am loading my text files with the method load_files
categories={'label0','label1'}
text_data = load_files(path,categories=categories)
from the following structure:
train
├── Label0
│ ├── 0001.txt
│ └── 0002.txt
└── Label1
├── 0001.txt
└── 0002.txt
my problem is that when I try to look at the shape of text_data.data it returns:
print (type(text_data.data))
<type 'list'>
print text_data.data.shape
AttributeError: 'list' object has no attribute 'shape'
X = np.array(text_data.data)
print x.shape
(35,)
it returns 1D array .. I thought it should be 2D numpy array or a dictionary where the first will be for the text and the other one will be for the class (label0 or 1 ) .. have I missed something ?
Upvotes: 0
Views: 437
Reputation: 8270
The problem is after calling load_files, it is not yet a numpy array. It is just a list of text. You should vectorize this text using CountVectorizer
or TfidfVectorizer
.
Example:
cv = CountVectorizer()
X = cv.fit_transform(text_data.data)
y = text_data.target
print cv.vocabulary_ # Show words in vocabulary with column index
clf = LogisticRegression() # or other classifier
clf.fit(X, y)
Upvotes: 1