Reputation:
I've been trying to run some ML code but I keep faltering at the fitting stage after running my pipeline. I've looked around on various forums to not much avail. What I've discovered is that some people say you can't use LabelEncoder within a pipeline. I'm not sure how true that is. If anyone has any insights on the matter I'd be very happy to hear them.
I keep getting this error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
And so I'm not sure if the problem is from me or from python. Here's my code:
data = pd.read_csv("ks-projects-201801.csv",
index_col="ID",
parse_dates=["deadline","launched"],
infer_datetime_format=True)
var = list(data)
data = data.drop(labels=[1014746686,1245461087, 1384087152, 1480763647, 330942060, 462917959, 69489148])
missing = [i for i in var if data[i].isnull().any()]
data = data.dropna(subset=missing,axis=0)
le = LabelEncoder()
oe = OrdinalEncoder()
oh = OneHotEncoder()
y = [i for i in var if i=="state"]
y = data[var.pop(8)]
p,p.index = pd.Series(le.fit_transform(y)),y.index
q = pd.read_csv("y.csv",index_col="ID")["0"]
label_y = le.fit_transform(y)
x = data[var]
obj_feat = x.select_dtypes(include="object")
dat_feat = x.select_dtypes(include="datetime64[ns]")
dat_feat = dat_feat.assign(dmonth=dat_feat.deadline.dt.month.astype("int64"),
dyear = dat_feat.deadline.dt.year.astype("int64"),
lmonth=dat_feat.launched.dt.month.astype("int64"),
lyear=dat_feat.launched.dt.year.astype("int64"))
dat_feat = dat_feat.drop(labels=["deadline","launched"],axis=1)
num_feat = x.select_dtypes(include=["int64","float64"])
u = dict(zip(list(obj_feat),[len(obj_feat[i].unique()) for i in obj_feat]))
le_obj = [i for i in u if u[i]<10]
oh_obj = [i for i in u if u[i]<20 and u[i]>10]
te_obj = [i for i in u if u[i]>20 and u[i]<25]
cb_obj = [i for i in u if u[i]>100]
# Pipeline time
#Impute and encode
strat = ["constant","most_frequent","mean","median"]
sc = StandardScaler()
oh_unk = "ignore"
encoders = [LabelEncoder(),
OneHotEncoder(handle_unknown=oh_unk),
TargetEncoder(),
CatBoostEncoder()]
#num_trans = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[2])),
num_trans = Pipeline(steps=[("sc",sc)])
#obj_imp = Pipeline(steps=[("imp",SimpleImputer(strategy=strat[1]))])
oh_enc = Pipeline(steps=[("oh_enc",encoders[1])])
te_enc = Pipeline(steps=[("te_enc",encoders[2])])
cb_enc = Pipeline(steps=[("cb_enc",encoders[0])])
trans = ColumnTransformer(transformers=[
("num",num_trans,list(num_feat)+list(dat_feat)),
#("obj",obj_imp,list(obj_feat)),
("onehot",oh_enc,oh_obj),
("target",te_enc,te_obj),
("catboost",cb_enc,cb_obj)
])
models = [RandomForestClassifier(random_state=0),
KNeighborsClassifier(),
DecisionTreeClassifier(random_state=0)]
model = models[2]
print("Check 4")
# Chaining it all together
run = Pipeline(steps=[("Transformation",trans),("Model",model)])
x = pd.concat([obj_feat,dat_feat,num_feat],axis=1)
print("Check 5")
run.fit(x,p)
It runs fine until run.fit where it throws an error. I'd love to hear any advice anyone might have, and any possible ways to resolve this problem would also be greatly appreciated! Thank you.
Upvotes: 2
Views: 3201
Reputation: 88236
The problem is the same as spotted in this answer, but with a LabelEncoder
in your case. The LabelEncoder
's fit_transform
method takes:
def fit_transform(self, y):
"""Fit label encoder and return encoded labels
...
Whereas Pipeline
is expecting that all its transformers are taking three positional arguments fit_transform(self, X, y)
.
You could make a custom transformer as in the aforementioned answer, however, a LabelEncoder
should not be used as a feature transformer. An extensive explanation on why can be seen in LabelEncoder for categorical features?. So I'd recommend not using a LabelEcoder
and using some other bayesian encoders if the amount of features gets too high such as the TargetEncoder
which you also have in the list of encoders.
Upvotes: 4