Reputation: 377
I have a pandas dataframe that looks like this:
Customer Product
A Table
A Chair
A Desk
and when I run the Pandas get_dummies function on Product, I get this:
Customer Product_Table Product_Chair Product_Desk
A 1 0 0
A 0 1 0
A 0 0 1
Is this correct in terms of pre-modeling? It would seem that I'm feeding it customer A information 3 different times. The first time I'm saying it only has Table and no chairs or desk, but in reality they have all three.
How does this affect the model? My gut tells me that when I do this type of conversion I should end up with only 1 line? Is that right? And if so, what did I do wrong, or need to add, in order to eliminate the 'duplicate' rows?
Below is the syntax I'm using:
# Create a list of features to dummy
todummy_list = []
for col_name in sdf.columns:
if sdf[col_name].dtypes == 'object' and (col_name != 'Customer' ):
todummy_list.append(col_name)
print(todummy_list)
# Function to dummy all the categorical variables used for modeling
def dummy_df(df, todummy_list):
for x in todummy_list:
dummies = pd.get_dummies(sdf[x], prefix=x, dummy_na=False)
df = df.drop(x, 1)
df = pd.concat([df, dummies], axis=1)
return df
sdf = dummy_df(sdf, todummy_list)
print(sdf.head(5))
Upvotes: 1
Views: 1116
Reputation: 31
The list you created is empty. You need to fill it up for example:
todummy_list = ['age','sex','working-class']
Upvotes: 0
Reputation: 164773
To eliminate the "duplicate rows", you can just use pd.crosstab
:
res = pd.crosstab(df['Customer'], df['Product'])
print(res)
Product Chair Desk Table
Customer
A 1 1 1
Upvotes: 1