Reputation: 5
I am new to programming and I was working with the titanic dataset from Kaggle. I have been trying to build the Logistic Regression model after performing one-hot encoding. But I keep getting the error. I think the error is caused due to the dummy variable. Below is my code.
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
#Loading data
df=pd.read_csv(r"C:\Users\Downloads\train.csv")
#Deleting unwanted columns
df.drop(["PassengerId","Name","Cabin","Ticket"],axis=1,inplace=True)
#COunt of Missing values in each column
print(df.isnull().sum())
#Deleting rows with missing values based on column name
df.dropna(subset=['Embarked','Age'],inplace=True)
print(df.isnull().sum())
#One hot encoding for categorical variables
#Creating dummy variables for Sex column
dummies = pd.get_dummies(df.Sex)
dummies2=pd.get_dummies(df.Embarked)
#Appending the dummies dataframe with original dataframe
new_df= pd.concat([df,dummies,dummies2],axis='columns')
print(type(new_df))
#print(new_df.head(10))
#Drop the original sex,Embarked column and one of the dummy column for bth variables
new_df.drop(['Sex','Embarked'],axis='columns',inplace=True)
print(new_df.head(10))
new_df.info()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
x = df.drop('Survived', axis=1)
y = df['Survived']
logmodel = LogisticRegression()
logmodel.fit(x, y)
Upvotes: 0
Views: 2324
Reputation: 12992
As we discussed in the comments, here is the solution:
First, you need to modify your x
and y
variables to use new_df
instead of df
just like so:
x = new_df.drop('Survived', axis=1)
y = new_df['Survived']
Then, you need to increase the iteration of your Logistic Regression Model like so:
logmodel = LogisticRegression(max_iter=1000)
Upvotes: 1