Reputation: 92
Im new to nlp , and i took this project on kaagle as my first project. link:https://www.kaggle.com/Cornell-University/arxiv
as i said im new and its my first encounter with some tools and im tryin to run code below but i came a cross with a problem. here is the code:
read data
data = data[['id', 'title', 'abstract', 'categories']]
categories = list(set([i for l in [x.split(' ') for x in data['categories']] for i in l]))
train, test = train_test_split(data, random_state=42, test_size=0.33, shuffle=True)
X_train = train.abstract
X_test = test.abstract
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train, train[category])
# compute the testing accuracy
prediction = SVC_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
but every time it holds at :
SVC_pipeline.fit(X_train, train[category])
and says there is a keyError:'astro-ph.HE' which is the first category in categories. I might have been makin a simple mistake but I've been stuck here for hours. tnx
Upvotes: 1
Views: 299
Reputation: 48
If you just want to multi class classification, below code is enough.
# no loop with categories
SVC_pipeline.fit(X_train, train.category)
You can subscript pandas dataframe only by column name(id, title, abstract, categories). But you use a value of category column. So key error happened.
If you want to filter the dataframe by value, use below code.
train[train[‘category’] == category]
Upvotes: 1