Reputation: 93
I am trying with a sample dataFrame :
data = [['Alex','USA',0],['Bob','India',1],['Clarke','SriLanka',0]]
df = pd.DataFrame(data,columns=['Name','Country','Traget'])
Now from here, I used get_dummies to convert string column to an integer:
column_names=['Name','Country']
one_hot = pd.get_dummies(df[column_names])
After conversion the columns are: Age,Name_Alex,Name_Bob,Name_Clarke,Country_India,Country_SriLanka,Country_USA
x=df[["Name_Alex","Name_Bob","Name_Clarke","Country_India","Country_SriLanka","Country_USA"]].values
y=df['Age'].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=float(0.5),random_state=0)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
Now, model is trained.
For prediction let say i want to predict the "target" by giving "Name" and "Country".
Like : ["Alex","USA"].
If I used this:
logreg.predict([["Alex","USA"]).
obviously it will not work.
Upvotes: 5
Views: 10323
Reputation: 25
If the above code is giving TypeError, Use the below code which is updated using iloc.
labelencoder_dict = {}
onehotencoder_dict = {}
X_train = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
labelencoder_dict[i] = label_encoder
feature = label_encoder.fit_transform(X.iloc[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False)
feature = onehot_encoder.fit_transform(feature)
onehotencoder_dict[i] = onehot_encoder
if X_train is None:
X_train = feature
else:
X_train = np.concatenate((X_train, feature), axis=1)
def getEncoded(test_data,labelencoder_dict,onehotencoder_dict):
test_encoded_x = None
for i in range(0,test_data.shape[1]):
label_encoder = labelencoder_dict[i]
feature = label_encoder.transform(test_data.iloc[:,i])
feature = feature.reshape(test_data.shape[0], 1)
onehot_encoder = onehotencoder_dict[i]
feature = onehot_encoder.transform(feature)
if test_encoded_x is None:
test_encoded_x = feature
else:
test_encoded_x = np.concatenate((test_encoded_x, feature), axis=1)
return test_encoded_x
Upvotes: 0
Reputation: 152
I suggest you to use sklearn label encoders and one hot encoder packages instead of pd.get_dummies.
Once you initialise label encoder and one hot encoder per feature then save it somewhere so that when you want to do prediction on the data you can easily import saved label encoders and one hot encoders and encode your features again.
This way you are encoding your features again in the same way as you did while making training set.
Below is the code which I use for saving encoders:
labelencoder_dict = {}
onehotencoder_dict = {}
X_train = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
labelencoder_dict[i] = label_encoder
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False)
feature = onehot_encoder.fit_transform(feature)
onehotencoder_dict[i] = onehot_encoder
if X_train is None:
X_train = feature
else:
X_train = np.concatenate((X_train, feature), axis=1)
Now I save this onehotencoder_dict and label encoder_dict and use it later for encoding.
def getEncoded(test_data,labelencoder_dict,onehotencoder_dict):
test_encoded_x = None
for i in range(0,test_data.shape[1]):
label_encoder = labelencoder_dict[i]
feature = label_encoder.transform(test_data[:,i])
feature = feature.reshape(test_data.shape[0], 1)
onehot_encoder = onehotencoder_dict[i]
feature = onehot_encoder.transform(feature)
if test_encoded_x is None:
test_encoded_x = feature
else:
test_encoded_x = np.concatenate((test_encoded_x, feature), axis=1)
return test_encoded_x
Upvotes: 6