Reputation: 2033
I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have used OneHotEncoder to encode all the categorical variables in my dataset
# Encode categorical data with oneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
Z = ohe.fit_transform(Z)
I now want to create a dataframe with the results from the OneHotEncoder. I want the dataframe columns to be the new categories that resulted from the encoding, that is why I am using the categories_ attribute. When running the following line of code:
ohe_df = pd.DataFrame(Z, columns=ohe.categories_)
I get the error: ValueError: all arrays must be same length
I understand that the arrays being referred to in the error message are the arrays of categories, each of which has a different length depending on the number of categories it contains, but am not sure what the correct way of creating a dataframe with the new categories as columns is (when there are multiple features).
I tried to do this with a small dataset that contained one feature only and it worked:
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
df = pd.DataFrame(['Male', 'Female', 'Female'])
results = ohe.fit_transform(df)
ohe_df = pd.DataFrame(results, columns=ohe.categories_)
ohe_df.head()
Female Male
0 0.0 1.0
1 1.0 0.0
2 1.0 0.0
So how do I do the same for my large dataset with numerous features.
Thank you in advance.
EDIT:
As requested, I have come up with a MWE to demonstrate how it is not working:
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
print(results)
By this step, I have created a dataframe of categorical data and encoded it. I now want to create another dataframe such that the columns of the new dataframe are the categories created by the OneHotEncoder and rows are the encoded data. To do this I tried two things:
ohe_df = pd.DataFrame(results, columns=np.concatenate(ohe.categories_))
And I tried:
ohe_df = pd.DataFrame(results, columns=ohe.get_feature_names(input_features=df.columns))
Which both resulted in the error: ValueError: Shape of passed values is (4, 1), indices imply (4, 9)
Upvotes: 0
Views: 829
Reputation: 153460
IIUC,
import numpy as np
import pandas as pd
# create dataframe
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
['Female','missing','None'], ['Male','Yes','Ventouse']]),
columns=['gender', 'diabetes', 'assistance'])
df.head()
# encode categorical data
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
df_results = pd.DataFrame.sparse.from_spmatrix(results)
df_results.columns = ohe.get_feature_names(df.columns)
df_results
Output:
gender_Female gender_Male diabetes_No diabetes_Yes diabetes_missing assistance_Forceps assistance_Forceps and ventouse assistance_None assistance_Ventouse
0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0
1 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
Note, the output of ohe.fit_transform(df)
is a sparse matrix.
print(type(results))
<class 'scipy.sparse.csr.csr_matrix'>
You can convert this to a dataframe using pd.DataFrame.sparse.from_spmatrix
. Then, we can use ohe.get_feature_names
and passing the original dataframe columns to name your columns in the results dataframe, df_results.
Upvotes: 4
Reputation: 12592
ohe.categories_
is a list of arrays, one array for each feature. You need to flatten that into a 1D list/array for pd.DataFrame
, e.g. with np.concatenate(ohe.categories_)
.
But probably better, use the builtin method get_feature_names
.
Upvotes: 3