Reputation: 182
I'm having data like this:-
|--------|---------|
| Col1 | Col2 |
|--------|---------|
| X | a,b,c |
|--------|---------|
| Y | a,b |
|--------|---------|
| X | b,d |
|--------|---------|
I want to convert these categorical data to dummy variables. Since my data is large its giving memory error if i'm using get_dummies()
from pandas. I want my result like this:-
|------|------|------|------|------|------|
|Col_X |Col_Y |Col2_a|Col2_b|Col2_c|Col2_d|
|------|------|------|------|------|------|
| 1 | 0 | 1 | 1 | 1 | 0 |
|------|------|------|------|------|------|
| 0 | 1 | 1 | 1 | 0 | 0 |
|------|------|------|------|------|------|
| 1 | 0 | 0 | 1 | 0 | 1 |
|------|------|------|------|------|------|
I have tried to convert Col2 using this but getting MemoryError as data is large and there is lot of variability in col2 too.
So,
1) How can I convert multiple categorical columns into dummy variable?
2) pandas get_dummy() is giving memory error, so how could i handle that?
Upvotes: 1
Views: 3126
Reputation: 182
I would like to give my solution as well. And I would like to thank @James-dellinger for the answer. So here is my approach
df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
'Col2': ['a,b,c', 'a,b', 'b,d']})
df
Col1 Col2
0 X a,b,c
1 Y a,b
2 X b,d
I first split Col2 values and convert it into column values.
df= pd.DataFrame(df['Col2'].str.split(',',3).tolist(),columns = ['Col1','Col2','Col3'])
df
Col1 Col2 Col3
0 a b c
1 a b None
2 b d None
Then I applied dummy creation on this dataframe without giving any prefix.
df=pd.get_dummies(df, prefix="")
df
_a _b _b _d _c
0 1 0 1 0 1
1 1 0 1 0 0
2 0 1 0 1 0
Now to get the desired result we can sum up all the duplicate columns.
df.groupby(level=0, axis=1).sum()
df
_a _b _c _d
0 1 1 1 0
1 1 1 0 0
2 0 1 0 1
For Col1 we can directly create dummy variables using pd.get_dummies()
and store it into different dataframe suppose col1_df
. We can concat both columns using pd.concat([df,col1_df], axis=1, sort=False)
Upvotes: 1
Reputation: 1261
I'm almost positive that you're encountering memory issues because str.get_dummies returns an array full of 1s and 0s, of datatype np.int64
. This is quite different from the behavior of pd.get_dummies, which returns an array of values of datatype uint8
.
This appears to be a known issue. However, there's been no update, nor fix, for the past year. Checking out the source code for str.get_dummies will indeed confirm that it is returning np.int64
.
An 8 bit integer will take up 1 byte of memory, while a 64 bit integer will take up 8 bytes. I'm hopeful that memory problems can be avoided by finding an alternative way to one-hot encode Col2
which ensures the output are all 8 bit integers.
Here was my approach, beginning with your example:
df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
'Col2': ['a,b,c', 'a,b', 'b,d']})
df
Col1 Col2
0 X a,b,c
1 Y a,b
2 X b,d
Col1
contains simple, non-delimited strings, we can easily one-hot encode it using pd.get_dummies:df = pd.get_dummies(df, columns=['Col1'])
df
Col2 Col1_X Col1_Y
0 a,b,c 1 0
1 a,b 0 1
2 b,d 1 0
So far so good.
df['Col1_X'].values.dtype
dtype('uint8')
Col2
:vals = list(df['Col2'].str.split(',').values)
vals = [i for l in vals for i in l]
vals = list(set(vals))
vals.sort()
vals
['a', 'b', 'c', 'd']
str.contains
to create a new column for each value, such as 'a'
. Each row in a new column will contain 1 if that row actually has the new column's value, such as 'a'
, inside its string in Col2
. As we create each new column, we make sure to convert its datatype to uint8
:col='Col2'
for v in vals:
n = col + '_' + v
df[n] = df[col].str.contains(v)
df[n] = df[n].astype('uint8')
df.drop(col, axis=1, inplace=True)
df
Col1_X Col1_Y Col2_a Col2_b Col2_c Col2_d
0 1 0 1 1 1 0
1 0 1 1 1 0 0
2 1 0 0 1 0 1
This results in a dataframe that meets your desired format. And thankfully, the integers in the four new columns that were one-hot encoded from Col2
only take up 1 byte each, as opposed to 8 bytes each.
df['Col2_a'].dtype
dtype('uint8')
If, on the outside chance, the above approach doesn't work. My advice would be to use str.get_dummies to one-hot encode Col2
in chunks of rows. Each time you do a chunk, you would convert its datatype from np.int64
to uint8
, and then transform the chunk to a sparse matrix. You could eventually concatenate all chunks back together.
Upvotes: 2