amin jahani
amin jahani

Reputation: 92

python traceback keyerror code for training model on kaggle text dataset

Im new to nlp , and i took this project on kaagle as my first project. link:https://www.kaggle.com/Cornell-University/arxiv

as i said im new and its my first encounter with some tools and im tryin to run code below but i came a cross with a problem. here is the code:

read data
data = data[['id', 'title', 'abstract', 'categories']]
categories = list(set([i for l in [x.split(' ') for x in data['categories']] for i in l]))

train, test = train_test_split(data, random_state=42, test_size=0.33, shuffle=True)
X_train = train.abstract
X_test = test.abstract

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])


for category in categories:
    print('... Processing {}'.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train, train[category])
    # compute the testing accuracy
    prediction = SVC_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

but every time it holds at :

SVC_pipeline.fit(X_train, train[category])

and says there is a keyError:'astro-ph.HE' which is the first category in categories. I might have been makin a simple mistake but I've been stuck here for hours. tnx

Upvotes: 1

Views: 299

Answers (1)

Kiona1018
Kiona1018

Reputation: 48

If you just want to multi class classification, below code is enough.

# no loop with categories
SVC_pipeline.fit(X_train, train.category)

You can subscript pandas dataframe only by column name(id, title, abstract, categories). But you use a value of category column. So key error happened.

If you want to filter the dataframe by value, use below code.

train[train[‘category’] == category]

Upvotes: 1

Related Questions