Tarun
Tarun

Reputation: 182

Convert categorical data into dummy set

I'm having data like this:-

|--------|---------|
| Col1   | Col2    |
|--------|---------|
| X      | a,b,c   |
|--------|---------|
| Y      | a,b     |
|--------|---------|
| X      | b,d     |
|--------|---------|

I want to convert these categorical data to dummy variables. Since my data is large its giving memory error if i'm using get_dummies() from pandas. I want my result like this:-

|------|------|------|------|------|------|
|Col_X |Col_Y |Col2_a|Col2_b|Col2_c|Col2_d|
|------|------|------|------|------|------|
|  1   |  0   |  1   |  1   |  1   |  0   |
|------|------|------|------|------|------|
|  0   | 1    |  1   |  1   |  0   |   0  |
|------|------|------|------|------|------|
|  1   | 0    |  0   |  1   |  0   |   1  |
|------|------|------|------|------|------|

I have tried to convert Col2 using this but getting MemoryError as data is large and there is lot of variability in col2 too.

So,

1) How can I convert multiple categorical columns into dummy variable?

2) pandas get_dummy() is giving memory error, so how could i handle that?

Upvotes: 1

Views: 3126

Answers (2)

Tarun
Tarun

Reputation: 182

I would like to give my solution as well. And I would like to thank @James-dellinger for the answer. So here is my approach

df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
               'Col2': ['a,b,c', 'a,b', 'b,d']})
df

  Col1  Col2
0   X   a,b,c
1   Y   a,b
2   X   b,d

I first split Col2 values and convert it into column values.

df= pd.DataFrame(df['Col2'].str.split(',',3).tolist(),columns = ['Col1','Col2','Col3'])

df

   Col1 Col2 Col3
0   a   b    c
1   a   b    None
2   b   d    None

Then I applied dummy creation on this dataframe without giving any prefix.

df=pd.get_dummies(df, prefix="")

df

    _a  _b  _b  _d  _c
0   1   0   1   0   1
1   1   0   1   0   0
2   0   1   0   1   0

Now to get the desired result we can sum up all the duplicate columns.

df.groupby(level=0, axis=1).sum()

df

    _a  _b  _c  _d
0   1   1   1   0
1   1   1   0   0
2   0   1   0   1

For Col1 we can directly create dummy variables using pd.get_dummies() and store it into different dataframe suppose col1_df. We can concat both columns using pd.concat([df,col1_df], axis=1, sort=False)

Upvotes: 1

James Dellinger
James Dellinger

Reputation: 1261

I'm almost positive that you're encountering memory issues because str.get_dummies returns an array full of 1s and 0s, of datatype np.int64. This is quite different from the behavior of pd.get_dummies, which returns an array of values of datatype uint8.

This appears to be a known issue. However, there's been no update, nor fix, for the past year. Checking out the source code for str.get_dummies will indeed confirm that it is returning np.int64.

An 8 bit integer will take up 1 byte of memory, while a 64 bit integer will take up 8 bytes. I'm hopeful that memory problems can be avoided by finding an alternative way to one-hot encode Col2 which ensures the output are all 8 bit integers.

Here was my approach, beginning with your example:

df = pd.DataFrame({'Col1': ['X', 'Y', 'X'],
                   'Col2': ['a,b,c', 'a,b', 'b,d']})
df

    Col1    Col2
0   X       a,b,c
1   Y       a,b
2   X       b,d
  1. Since Col1 contains simple, non-delimited strings, we can easily one-hot encode it using pd.get_dummies:
df = pd.get_dummies(df, columns=['Col1'])
df

    Col2    Col1_X  Col1_Y
0   a,b,c        1       0
1   a,b          0       1
2   b,d          1       0

So far so good.

df['Col1_X'].values.dtype
dtype('uint8')
  1. Let's get a list of all unique substrings contained inside the comma-delimited strings in Col2:
vals = list(df['Col2'].str.split(',').values)
vals = [i for l in vals for i in l]
vals = list(set(vals))
vals.sort()
vals

['a', 'b', 'c', 'd']
  1. Now we can loop through the above list of values and use str.contains to create a new column for each value, such as 'a'. Each row in a new column will contain 1 if that row actually has the new column's value, such as 'a', inside its string in Col2. As we create each new column, we make sure to convert its datatype to uint8:
col='Col2'
for v in vals:
    n = col + '_' + v
    df[n] = df[col].str.contains(v)
    df[n] = df[n].astype('uint8')

df.drop(col, axis=1, inplace=True)
df

    Col1_X  Col1_Y  Col2_a  Col2_b  Col2_c  Col2_d
0        1       0       1       1       1       0
1        0       1       1       1       0       0
2        1       0       0       1       0       1

This results in a dataframe that meets your desired format. And thankfully, the integers in the four new columns that were one-hot encoded from Col2 only take up 1 byte each, as opposed to 8 bytes each.

df['Col2_a'].dtype
dtype('uint8')

If, on the outside chance, the above approach doesn't work. My advice would be to use str.get_dummies to one-hot encode Col2 in chunks of rows. Each time you do a chunk, you would convert its datatype from np.int64 to uint8, and then transform the chunk to a sparse matrix. You could eventually concatenate all chunks back together.

Upvotes: 2

Related Questions