sums22
sums22

Reputation: 2033

ValueError when attempting to create dataframe with OneHotEncoder results

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have used OneHotEncoder to encode all the categorical variables in my dataset

    # Encode categorical data with oneHotEncoder
    from sklearn.preprocessing import OneHotEncoder
    ohe = OneHotEncoder(handle_unknown='ignore')
    Z = ohe.fit_transform(Z)

I now want to create a dataframe with the results from the OneHotEncoder. I want the dataframe columns to be the new categories that resulted from the encoding, that is why I am using the categories_ attribute. When running the following line of code:

    ohe_df = pd.DataFrame(Z, columns=ohe.categories_)

I get the error: ValueError: all arrays must be same length

I understand that the arrays being referred to in the error message are the arrays of categories, each of which has a different length depending on the number of categories it contains, but am not sure what the correct way of creating a dataframe with the new categories as columns is (when there are multiple features).

I tried to do this with a small dataset that contained one feature only and it worked:

    ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
    df = pd.DataFrame(['Male', 'Female', 'Female'])
    results = ohe.fit_transform(df)

    ohe_df = pd.DataFrame(results, columns=ohe.categories_)
    ohe_df.head()
    
        Female  Male
    0   0.0     1.0
    1   1.0     0.0
    2   1.0     0.0

So how do I do the same for my large dataset with numerous features.

Thank you in advance.

EDIT:

As requested, I have come up with a MWE to demonstrate how it is not working:


    import numpy as np
    import pandas as pd
    
    # create dataframe 
    df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
                                 ['Female','missing','None'], ['Male','Yes','Ventouse']]), 
                      columns=['gender', 'diabetes', 'assistance'])
    
    df.head()
    
    # encode categorical data 
    from sklearn.preprocessing import OneHotEncoder
    
    ohe = OneHotEncoder(handle_unknown='ignore')
    results = ohe.fit_transform(df)
    print(results)

By this step, I have created a dataframe of categorical data and encoded it. I now want to create another dataframe such that the columns of the new dataframe are the categories created by the OneHotEncoder and rows are the encoded data. To do this I tried two things:

    ohe_df = pd.DataFrame(results, columns=np.concatenate(ohe.categories_))

And I tried:

    ohe_df = pd.DataFrame(results, columns=ohe.get_feature_names(input_features=df.columns))

Which both resulted in the error: ValueError: Shape of passed values is (4, 1), indices imply (4, 9)

Upvotes: 0

Views: 829

Answers (2)

Scott Boston
Scott Boston

Reputation: 153460

IIUC,

import numpy as np
import pandas as pd

# create dataframe 
df = pd.DataFrame(np.array([['Male', 'Yes', 'Forceps'], ['Female', 'No', 'Forceps and ventouse'],
                            ['Female','missing','None'], ['Male','Yes','Ventouse']]), 
                    columns=['gender', 'diabetes', 'assistance'])

df.head()

# encode categorical data 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore')
results = ohe.fit_transform(df)
df_results = pd.DataFrame.sparse.from_spmatrix(results)
df_results.columns = ohe.get_feature_names(df.columns)
df_results

Output:

   gender_Female  gender_Male  diabetes_No  diabetes_Yes  diabetes_missing  assistance_Forceps  assistance_Forceps and ventouse  assistance_None  assistance_Ventouse
0            0.0          1.0          0.0           1.0               0.0                 1.0                              0.0              0.0                  0.0
1            1.0          0.0          1.0           0.0               0.0                 0.0                              1.0              0.0                  0.0
2            1.0          0.0          0.0           0.0               1.0                 0.0                              0.0              1.0                  0.0
3            0.0          1.0          0.0           1.0               0.0                 0.0                              0.0              0.0                  1.0

Note, the output of ohe.fit_transform(df) is a sparse matrix.

print(type(results))
<class 'scipy.sparse.csr.csr_matrix'>

You can convert this to a dataframe using pd.DataFrame.sparse.from_spmatrix. Then, we can use ohe.get_feature_names and passing the original dataframe columns to name your columns in the results dataframe, df_results.

Upvotes: 4

Ben Reiniger
Ben Reiniger

Reputation: 12592

ohe.categories_ is a list of arrays, one array for each feature. You need to flatten that into a 1D list/array for pd.DataFrame, e.g. with np.concatenate(ohe.categories_).

But probably better, use the builtin method get_feature_names.

Upvotes: 3

Related Questions