user76595
user76595

Reputation: 377

Why does Pandas get_dummies function not also perform a 'pivot'?

I have a pandas dataframe that looks like this:

Customer      Product
   A           Table
   A           Chair
   A           Desk

and when I run the Pandas get_dummies function on Product, I get this:

Customer   Product_Table    Product_Chair    Product_Desk
   A             1                 0                0 
   A             0                 1                0
   A             0                 0                1

Is this correct in terms of pre-modeling? It would seem that I'm feeding it customer A information 3 different times. The first time I'm saying it only has Table and no chairs or desk, but in reality they have all three.

How does this affect the model? My gut tells me that when I do this type of conversion I should end up with only 1 line? Is that right? And if so, what did I do wrong, or need to add, in order to eliminate the 'duplicate' rows?

Below is the syntax I'm using:

# Create a list of features to dummy
todummy_list = []
for col_name in sdf.columns:
    if sdf[col_name].dtypes == 'object' and (col_name != 'Customer' ):
        todummy_list.append(col_name)
print(todummy_list)


# Function to dummy all the categorical variables used for modeling
def dummy_df(df, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(sdf[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df

sdf = dummy_df(sdf, todummy_list)

print(sdf.head(5))

Upvotes: 1

Views: 1116

Answers (2)

Sudeep Das
Sudeep Das

Reputation: 31

The list you created is empty. You need to fill it up for example:

todummy_list = ['age','sex','working-class']

Upvotes: 0

jpp
jpp

Reputation: 164773

To eliminate the "duplicate rows", you can just use pd.crosstab:

res = pd.crosstab(df['Customer'], df['Product'])

print(res)

Product   Chair  Desk  Table
Customer                    
A             1     1      1

Upvotes: 1

Related Questions