Reputation: 167
I have a csv file which has 25 columns some are numeric and some are categorical and some are like names of actors, directors. I want use regression models on this data. In order to do so I have to convert the categorical columns string types to numeric values using LabelBinarizer from scikit package. How can I use LabelBinarize on this dataframe which has multiple categorical data?
Essentially I want to binarize the labels and add them to the dataframe.
In the below code, I have retrieved the list of the columns I want to binarize not able to figure out how to add the new column back to the df?
categorylist = ['color', 'language', 'country', 'content_rating']
for col in categorylist:
tempdf = label_binarizer.fit_transform(df[col])
In the next step, I want add the tempdf
to df
and drop the original column df[col].
Upvotes: 6
Views: 7629
Reputation: 36555
You can do this in a one-liner with pd.get_dummies
:
tempdf = pd.get_dummies(df, columns=categorylist)
Otherwise you can use a FeatureUnion
with FunctionTransformer
as in the answer to sklearn pipeline - how to apply different transformations on different columns
EDIT: As added by @dukebody in the comments, you can also use the sklearn-pandas package which purpose is to be able to apply different transformations to each dataframe column.
Upvotes: 8