cryp
cryp

Reputation: 2385

Python Pandas get_dummies() limitation. Doesnt convert all columns

I have 6 columns in my dataframe. 2 of them have about 3K unique values. When I use get_dummies() on the entire dataframe or just one those 2 columns what gets returned is the exact same column with 3k values. get_dummies fails to dummy-fy the bigger columns. Some columns do get one-hot encoded but the big ones dont.

I wonder if get_dummies only works on sets with small cardinality.

I believe this was also discusses here: Need help with python(pandas) script

Upvotes: 4

Views: 4853

Answers (1)

piRSquared
piRSquared

Reputation: 294488

It appears to work as intended for me.

Consider the series s of random 3 character strings

import pandas as pd
import numpy as np
from string import lowercase

np.random.seed([3,1415])
s = pd.DataFrame(np.random.choice(list(lowercase), (10000, 3))).sum(1)

s.nunique()

7583

Then assign the dataframe df

df = s.str.get_dummies()

df.shape

(10000, 7583)

df.sum(1).describe()

count    10000.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
dtype: float64

Upvotes: 4

Related Questions