Henk Straten
Henk Straten

Reputation: 1445

Create a data frame with predicted values, real values and original features

I have the following dataset:

input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1], ['This is not the news', 1]], columns=('feature1', 'Tag'))

That I want to turn into a TF-IDF matrix using the following function

def TfifdMatrix(inputSet):
    vectorizer = CountVectorizer()
    vectorizer.fit_transform(inputSet)
    print("fit transform done")
    smatrix = vectorizer.transform(inputSet)
    print("transform done")
    smatrix = smatrix.todense()
    tfidf = TfidfTransformer(norm="l2")
    tfidf.fit(smatrix)
    tf_idf_matrix = tfidf.transform(smatrix)
    print("transformation done")
    TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
    return (TfidfMatrix)

Now I transform the data and add the tag

input_data2 = TfifdMatrix(input_data['feature1'])
input_data = pd.concat([input_data, input_data2], axis=1)

Create a training- and testset

train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]

train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)

test_features2 = test['Tag']

I not I train a decision tree algorith on it

my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('feature1', axis=1))

Now I combine everyhting to get an overview of the original features, the real outcome, the predicted outcome and the TF-IDF matrix

df_final = pd.DataFrame(test_features, test_outcome)
df_final['Prediction'] = my_dt_prediction

This however gives me the follwoing data:

  feature1   0   1   2   3   4  Prediction
  1      NaN NaN NaN NaN NaN NaN           1

Any thoughts on where this goes wrong?

Upvotes: 0

Views: 1452

Answers (1)

Scratch'N'Purr
Scratch'N'Purr

Reputation: 10429

Considering that you are using sklearn already, I would have used train_test_split to do the dataset splitting.

from sklearn.model_selection import train_test_split
from sklearn import tree
import pandas as pd

Y = input_data['Tag']
X = input_data.drop('Tag', axis=1)

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=123)

# Train and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(Xtrain, Ytrain)
my_dt_prediction = my_tree_one.predict(Xtest)

# Join it all
complete_df = pd.concat([Xtest, Ytest], axis=1)  # features and actual
complete_df['Predicted'] = my_dt_prediction  # creates a predicted column to the complete_df, now you'll have features, actual, and predicted

You could eliminate a line and create your predictions column and generate the predictions in one line:

complete_df['Predicted'] = my_tree_one.predict(Xtest)

--UPDATE--

So in my comment, I was mentioning using a "key" column, but the solution is actually simpler than that.

Assuming your input_data contains the original word features and the target variable, then just apply the TDIDF algorithm to your input_data and add the TDIDF transformed matrix to the input_data.

input_data = pd.DataFrame([['This is the news', 0], ['This is the news', 0], ['This is not the news', 1]], columns=('feature1', 'Tag'))

def TfifdMatrix(inputSet):  
    vectorizer = CountVectorizer()
    vectorizer.fit_transform(inputSet)
    print("fit transform done")

smatrix = vectorizer.transform(inputSet)

print("transform done")
smatrix = smatrix.todense()
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(smatrix)
tf_idf_matrix = tfidf.transform(smatrix)

print("transformation done")

TfidfMatrix = pd.DataFrame(tf_idf_matrix.todense())
return (TfidfMatrix)

input_data2 = TfidfMatrix(input_data['Feature1'])

# Add your TDIDF transformation matrix
input_data = pd.concat([input_data, input_data2], axis=1)

# Now do your usual train/test split
train = input_data.sample(frac=0.8, random_state=1)
test = input_data.loc[~input_data.index.isin(train.index)]
train_outcome = train['Tag'].values
train_features = train.drop('Tag', axis=1)
test_outcome = test['Tag'].values
test_features = test.drop('Tag', axis=1)

# Now train but make sure to drop your original word feature for both fit and predict
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(train_features.drop('Feature1', axis=1), train_outcome)
my_dt_prediction = my_tree_one.predict(test_features.drop('Feature1', axis=1))

# Now combine
df_final = pd.DataFrame(test_features, test_outcomes)
df_final['Prediction'] = my_dt_prediction

You should get a dataframe with your original word features, the TDIDF transformed features, your actual values, and your predicted values.

Upvotes: 1

Related Questions