Reputation: 953
I am trying to make a classification in which one file is entirely the training and another file is entirely the test. It's possible? I tried:
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
#csv file from train
df = pd.read_csv('data_train.csv', sep = ',')
#csv file from test
df_test = pd.read_csv('data_test.csv', sep = ',')
#Randomising the rows in the file
df = df.reindex(np.random.permutation(df.index))
df_test = df_test.reindex(np.random.permutation(df_test.index))
vect = CountVectorizer()
X = vect.fit_transform(df['data_train'])
y = df['label']
X_T = vect.fit_transform(df_test['data_test'])
y_t = df_test['label']
X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
tf_transformer = TfidfTransformer(use_idf=False).fit(X)
X_train_tf = tf_transformer.transform(X)
X_train_tf.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X)
X_train_tfidf.shape
tf_transformer = TfidfTransformer(use_idf=False).fit(X_T)
X_train_tf_teste = tf_transformer.transform(X_T)
X_train_tf_teste.shape
tfidf_transformer = TfidfTransformer()
X_train_tfidf_teste = tfidf_transformer.fit_transform(X_T)
X_train_tfidf_teste.shape
#RegLog
clf = LogisticRegression().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, y_pred, labels = y))
print("F-score")
print(f1_score(y_test, y_pred, average=None))
print(precision_score(y_test, y_pred, average=None))
print(recall_score(y_test, y_pred, average=None))
print("cross validation")
scores = cross_validation.cross_val_score(clf, X, y, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
I have set test_size to zero because I do not want to have a partition in those files. And I also applied Count and TFIDF in the training and test file.
My output error:
Traceback (most recent call last):
File "classif.py", line 34, in X_train, y_train = train_test_split(X, y, test_size = 0, random_state = 100)
ValueError: too many values to unpack (expected 2)
Upvotes: 0
Views: 1793
Reputation: 2161
So first, for the error you get , just write the code as follow, it should work.
X_train, y_train,_,_ = train_test_split(X, y, test_size = 0, random_state = 100)
X_test, y_test,_,_ = train_test_split(X_T, y_t, test_size = 0, random_state = 100)
the code is made to return 4 sets and it is expected that you have 4 variables to receive them. Putting _
is just to let everyone know that you don't care about those outputs.
Second, i don't really know why you are doing this manipulation. If you want to shuffle the data, it's not the best way to do it. And you have already done it before.
Upvotes: -1
Reputation: 439
The error you are getting in train_test_split is clearly indicated and solved by @Alexis. And once again I also suggest to not use train_test_split as it will not do anything except shuffling, which you have already done.
But I want to highlight another important point, i.e., If you are keeping your train and test files separately, then just don't fit vectorizers separately. It will create different columns for train and test files. Example:
cv = CountVectorizer()
train=['Hi this is stack overflow']
cv.fit(train)
cv.get_feature_names()
Output:
['hi', 'is', 'overflow', 'stack', 'this']
test=['Hi that is not stack overflow']
cv.fit(test)
cv.get_feature_names()
Output:
['hi', 'is', 'not', 'overflow', 'stack', 'that']
Hence, fitting them separately will result in columns mismatch. So, you should merge train and test files firstly and then fit_transform vectorizer collectively, or if you don't have test data beforehand you could only transform the test data using vectorizer fitted on train data, which will ignore the words not present in train data.
Upvotes: 5