gimbaptopokki
gimbaptopokki

Reputation: 31

ColumnTransformer and train_test_split process

I'm currently learning Scikit-learn (please don't scold me), and I'm a bit confused about the process regarding ColumnTransformer, training and prediction. I have a data set with features such as Gender, Married, Graduation status, Loan Amount, Income, etc. The data set has a mix of objects (strings) and integer values, but I would say the majority is objects. From my understanding, I need to convert the objects to integer values before training a model, and I do so using ColumnTransformer. But the process of training the model makes me a bit confused. This is my current code:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

df = pd.read_csv("loan_data.csv", sep=",")
df.replace("", np.nan, inplace=True)
df.dropna(inplace=True)
df = df.drop(columns=["Loan_ID"])

X = df.drop(columns=["LoanAmount"])
y = df["LoanAmount"]

loan_categories = ["Gender", "Married", "Dependents", "Education", "Self_Employed", "Property_Area", "Loan_Status"]
ohe = OneHotEncoder()

ct = make_column_transformer (
    (ohe, loan_categories),
    remainder="passthrough")

ct.fit_transform(X)

And then comes my confusion with train_test_split. Am I supposed to train_test_split before passing X to fit_transform, or would this happen now after I have defined ct?

The rest of my code would look something like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)

Upvotes: 3

Views: 829

Answers (1)

Ilya
Ilya

Reputation: 486

hi if you wan use fit_trasform try this :

    X = df.drop(columns=["LoanAmount"])
    y = df["LoanAmount"]
cv = CountVectorizer(max_features = 5000,ngram_range=(1,128),min_df=2,analyzer='word')
    x = cv.fit_transform(X).toarray()
    print("X.shape = ",x.shape)
    print("y.shape = ",y.shape)
        
        
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 42)
        
        model = DecisionTreeClassifier()
        model.fit(X_train, y_train)
        
        predictions = model.predict(X_test)
        score = accuracy_score(y_test, predictions)

Am I supposed to train_test_split before passing X to fit_transform? answer is NO

Upvotes: 0

Related Questions