python traceback keyerror code for training model on kaggle text dataset

Question

Im new to nlp , and i took this project on kaagle as my first project. link:https://www.kaggle.com/Cornell-University/arxiv

as i said im new and its my first encounter with some tools and im tryin to run code below but i came a cross with a problem. here is the code:

read data

data = data[['id', 'title', 'abstract', 'categories']]
categories = list(set([i for l in [x.split(' ') for x in data['categories']] for i in l]))

train, test = train_test_split(data, random_state=42, test_size=0.33, shuffle=True)
X_train = train.abstract
X_test = test.abstract

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])


for category in categories:
    print('... Processing {}'.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train, train[category])
    # compute the testing accuracy
    prediction = SVC_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

but every time it holds at :

SVC_pipeline.fit(X_train, train[category])

and says there is a keyError:'astro-ph.HE' which is the first category in categories. I might have been makin a simple mistake but I've been stuck here for hours. tnx

python traceback keyerror code for training model on kaggle text dataset

Answers (1)

Related Questions