Reputation: 1445
I have the following dataset:
input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1]], columns=('feature1', 'Tag'))
That I want to turn into a TF-IDF matrix using the following function
def TfifdMatrix(inputSet):
vectorizer = CountVectorizer()
vectorizer.fit_transform(inputSet)
print("fit transform done")
smatrix = vectorizer.transform(inputSet)
print("transform done")
smatrix = smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)
print("transformation done")
TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
return (TfidfMatrix)
Now I transform the data and add the tag
input_data2 = TfifdMatrix(input_data['feature1'])
input_data = pd.concat([input_data, input_data2], axis=1)
Create a training- and testset
train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]
train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)
test_features2 = test['Tag']
I not I train a decision tree algorith on it
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('feature1', axis=1))
Now I combine everyhting to get an overview of the original features, the real outcome, the predicted outcome and the TF-IDF matrix
df_final = pd.DataFrame(test_features, test_outcome)
df_final['Prediction'] = my_dt_prediction
This however gives me the follwoing data:
feature1 0 1 2 3 4 Prediction
1 NaN NaN NaN NaN NaN NaN 1
Any thoughts on where this goes wrong?
Upvotes: 0
Views: 1452
Reputation: 10429
Considering that you are using sklearn already, I would have used train_test_split
to do the dataset splitting.
from sklearn.model_selection import train_test_split
from sklearn import tree
import pandas as pd
Y = input_data['Tag']
X = input_data.drop('Tag', axis=1)
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=123)
# Train and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(Xtrain, Ytrain)
my_dt_prediction = my_tree_one.predict(Xtest)
# Join it all
complete_df = pd.concat([Xtest, Ytest], axis=1) # features and actual
complete_df['Predicted'] = my_dt_prediction # creates a predicted column to the complete_df, now you'll have features, actual, and predicted
You could eliminate a line and create your predictions column and generate the predictions in one line:
complete_df['Predicted'] = my_tree_one.predict(Xtest)
--UPDATE--
So in my comment, I was mentioning using a "key" column, but the solution is actually simpler than that.
Assuming your input_data
contains the original word features and the target variable, then just apply the TDIDF algorithm to your input_data
and add the TDIDF transformed matrix to the input_data
.
input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1]], columns=('feature1', 'Tag'))
def TfifdMatrix(inputSet):
vectorizer = CountVectorizer()
vectorizer.fit_transform(inputSet)
print("fit transform done")
smatrix = vectorizer.transform(inputSet)
print("transform done")
smatrix = smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)
print("transformation done")
TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
return (TfidfMatrix)
input_data2 = TfidfMatrix(input_data['Feature1'])
# Add your TDIDF transformation matrix
input_data = pd.concat([input_data, input_data2], axis=1)
# Now do your usual train/test split
train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]
train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)
# Now train but make sure to drop your original word feature for both fit and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('Feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('Feature1', axis=1))
# Now combine
df_final = pd.DataFrame(test_features, test_outcomes)
df_final['Prediction'] = my_dt_prediction
You should get a dataframe with your original word features, the TDIDF transformed features, your actual values, and your predicted values.
Upvotes: 1