Classification with one file with entirely the training and another file with entirely test

Question

I am trying to make a classification in which one file is entirely the training and another file is entirely the test. It's possible? I tried:

import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')

#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')

#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))

vect = CountVectorizer()

X = vect.fit_transform(df['data_train'])
y = df['label']

X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']

X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)

tf_transformer = TfidfTransformer(use_idf=False).fit(X) 
X_train_tf = tf_transformer.transform(X) 
X_train_tf.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X) 
X_train_tfidf.shape

tf_transformer = TfidfTransformer(use_idf=False).fit(X_T) 
X_train_tf_teste = tf_transformer.transform(X_T) 
X_train_tf_teste.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T) 
X_train_tfidf_teste.shape

#RegLog
clf = LogisticRegression().fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))

print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None)) 

print("cross validation")

scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))

I have set test_size to zero because I do not want to have a partition in those files. And I also applied Count and TFIDF in the training and test file.

My output error:

Traceback (most recent call last):

File "classif.py", line 34, in X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)

ValueError: too many values to unpack (expected 2)

Prashant Gupta · Accepted Answer

The error you are getting in train_test_split is clearly indicated and solved by @Alexis. And once again I also suggest to not use train_test_split as it will not do anything except shuffling, which you have already done.

But I want to highlight another important point, i.e., If you are keeping your train and test files separately, then just don't fit vectorizers separately. It will create different columns for train and test files. Example:

cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()

Output: ['hi', 'is', 'overflow', 'stack', 'this']

test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()

Output: ['hi', 'is', 'not', 'overflow', 'stack', 'that']

Hence, fitting them separately will result in columns mismatch. So, you should merge train and test files firstly and then fit_transform vectorizer collectively, or if you don't have test data beforehand you could only transform the test data using vectorizer fitted on train data, which will ignore the words not present in train data.

Classification with one file with entirely the training and another file with entirely test

Answers (2)

Related Questions