Reputation: 324
I have a pandas Dataframe which contains floats, dates, integers, and classes. Due to the sheer amount of column, what would be the most automated way for me to select columns who require it (mainly the ones which are classes) and then label encode those?
FYI: Dates must not be label encoded
Upvotes: 1
Views: 447
Reputation: 725
Try this -
# To select numerical and categorical columns
num_cols = X_train.select_dtypes(exclude="object").columns.tolist()
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
# you can also pass a list like -
cat_cols = X_train.select_dtypes(include=["object", "category"]).columns.tolist()
After that you can make a pipeline like this -
# numerical data preprocessing pipeline
num_pipe = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
# categorical data preprocessing pipeline
cat_pipe = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown="ignore", sparse=False),
)
# full pipeline
full_pipe = ColumnTransformer(
[("num", num_pipe, num_cols), ("cat", cat_pipe, cat_cols)]
)
Upvotes: 3
Reputation: 120391
You can use select_dtypes
to select columns by data type or filter
to select columns by name.
Upvotes: 1