Minh
Minh

Reputation: 2300

pandas: Combining Multiple Categories into One

Let's say I have categories, 1 to 10, and I want to assign red to value 3 to 5, green to 1,6, and 7, and blue to 2, 8, 9, and 10.

How would I do this? If I try

df.cat.rename_categories(['red','green','blue'])

I get an error: ValueError: new categories need to have the same number of items than the old categories! but if I put this in

df.cat.rename_categories(['green','blue','red', 'red', 'red'
                        'green', 'green', 'blue', 'blue' 'blue'])

I'll get an error saying that there are duplicate values.

The only other method I can think of is to write a for loop that'll go through a dictionary of the values and replace them. Is there a more elegant of resolving this?

Upvotes: 16

Views: 31478

Answers (8)

scign
scign

Reputation: 942

@Divakar's answer using pandas.DataFrame.explode to create the mapping is nifty, but stops at creating the reverse of the mapping needed. To expand on that answer, we need to reverse the mapping and the apply it to the series.

# Create a random series of integers with a categorical dtype as a demo
np.random.seed(0)
df = pd.Series(np.random.randint(1,11,6))

# build the mapping
m = {
    "red": [3,4,5],
    "green": [1,6,7],
    "blue": [2,8,9,10]
}

# convert to series, explode the lists and use a dictionary
# comprehension to reverse the mapping
mapper = {k:v for v,k in pd.Series(m).explode().iteritems()}

# run the mapping over the original df
new_df = df.map(mapper).astype('category')

# show the original and the new side by side
df_compare = pd.concat([df, new_df], axis=1))

print(df_compare)

Output:

    0   1
0   6   green
1   1   green
2   4   red
3   4   red
4   8   blue
5   10  blue

Upvotes: 0

utpal dutta
utpal dutta

Reputation: 11

(It has been quite some time since the question was asked. I am new to data science, so, pardon me if my solution is not up to the mark.)
I think, a simpler way will be to write a function and then map it to the series.

def color(num):
    blue = [2,8,9,10]
    green = [1,6,7]
    red  = [3,4,5]
    if num in blue:
        return 'blue'
    if num in green:
        return 'green'
    else:
        return 'red'
df.m2 = df.m1.apply(color)

Upvotes: 0

Niels Van Steen
Niels Van Steen

Reputation: 308

I know this is not the exact answer to the question, but I came across this question when searching for mine and thought it might help someone.

The thing is, here you know all the values you want to replace as 1 categorical, but my problem had to do with gender, I wanted Male, Female and Other but it contained Male, Female and a dozen of 'other' genders. How do you give all those other values the categorical type of 'other'?

Note that this is not my answer, I found it here: Conditionally create an "Other" category in categorical column The answer was posted by: user12705352 But I will paste it below here.

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

#Create categorical
 df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].astype(pd.CategoricalDtype(categories=df['NEIGHBORHOOD'].unique(),ordered=False))

Upvotes: 0

britodfbr
britodfbr

Reputation: 1991

Can be this way:

import pandas as pd
df = pd.DataFrame(range(1, 11), columns=['colors'])
color2cod = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}
cod2color = {cod: k for k, cods in color2cod.items() for cod in cods }

df['m'] = df.colors.map(cod2color.get)
df.m = df.m.astype('category')
print('---')
print(df.m.cat.categories)
print('---')
print(df.info())

Upvotes: 1

vector07
vector07

Reputation: 359

OK, this is slightly simpler, hopefully will stimulate further conversation.

OP's example input:

>>> my_data = {'numbers': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
>>> df = pd.DataFrame(data=my_data)
>>> df.numbers = df.numbers.astype('category')
>>> df.numbers.cat.rename_categories(['green','blue','red', 'red', 'red'
>>>                         'green', 'green', 'blue', 'blue' 'blue'])

This yields ValueError: Categorical categories must be unique as OP states.

My solution:

# write out a dict with the mapping of old to new
>>> remap_cat_dict = {
    1: 'green',
    2: 'blue',
    3: 'red',
    4: 'red',
    5: 'red',
    6: 'green',
    7: 'green',
    8: 'blue',
    9: 'blue',
    10: 'blue' }

>>> df.numbers = df.numbers.map(remap_cat_dict).astype('category')
>>> df.numbers
0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: numbers, dtype: category
Categories (3, object): [blue, green, red]

Forces you to write out a complete dict with 1:1 mapping of old categories to new, but is very readable. And then the conversion is pretty straightforward: use df.apply by row (implicit when .apply is used on a dataseries) to take each value and substitute it with the appropriate result from the remap_cat_dict. Then convert result to category and overwrite the column.

I encountered almost this exact problem where I wanted to create a new column with less categories converrted over from an old column, which works just as easily here (and beneficially doesn't involve overwriting a current column):

>>> df['colors'] = df.numbers.map(remap_cat_dict).astype('category')
>>> print(df)
  numbers colors
0       1  green
1       2   blue
2       3    red
3       4    red
4       5    red
5       6  green
6       7  green
7       8   blue
8       9   blue
9      10   blue

>>> df.colors

0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: colors, dtype: category
Categories (3, object): [blue, green, red]

EDIT 5/2/20: Further simplified df.numbers.apply(lambda x: remap_cat_dict[x]) with df.numbers.map(remap_cat_dict) (thanks @JohnE)

Upvotes: 5

Divakar
Divakar

Reputation: 221584

Seems pandas.explode released with pandas-0.25.0 (July 18, 2019) would fit right in there and hence avoid any looping -

# Mapping dict
In [150]: m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10]}

In [151]: pd.Series(m).explode().sort_values()
Out[151]: 
green     1
blue      2
red       3
red       4
red       5
green     6
green     7
blue      8
blue      9
blue     10
dtype: object

So, the result is a pandas series that has all the required mappings from values:index. Now, based on user-requirements, we might use it directly or if needed in different formats like dict or series, swap index and values. Let's explore those too.

# Mapping obtained
In [152]: s = pd.Series(m).explode().sort_values()

1) Output as dict :

In [153]: dict(zip(s.values, s.index))
Out[153]: 
{1: 'green',
 2: 'blue',
 3: 'red',
 4: 'red',
 5: 'red',
 6: 'green',
 7: 'green',
 8: 'blue',
 9: 'blue',
 10: 'blue'}

2) Output as series :

In [154]: pd.Series(s.index, s.values)
Out[154]: 
1     green
2      blue
3       red
4       red
5       red
6     green
7     green
8      blue
9      blue
10     blue
dtype: object

Upvotes: 11

JohnE
JohnE

Reputation: 30424

I certainly don't see an issue with @DSM's original answer here, but that dictionary comprehension might not be the easiest thing to read for some (although is a fairly standard approach in Python).

If you don't want to use a dictionary comprehension but are willing to use numpy then I would suggest np.select which is roughly as concise as @DSM's answer but perhaps a little more straightforward to read, like @vector07's answer.

import numpy as np 

number = [ df.numbers.isin([3,4,5]), 
           df.numbers.isin([1,6,7]), 
           df.numbers.isin([2,8,9,10]),
           df.numbers.isin([11]) ]

color  = [ "red", "green", "blue", "purple" ]

df.numbers = np.select( number, color )

Output (note this is a string or object column, but of course you can easily convert to a category with astype('category'):

0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue

It's basically the same thing, but you could also do this with np.where:

df['numbers2'] = ''
df.numbers2 = np.where( df.numbers.isin([3,4,5]),    "red",    df.numbers2 ) 
df.numbers2 = np.where( df.numbers.isin([1,6,7]),    "green",  df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([2,8,9,10]), "blue",   df.numbers2 )
df.numbers2 = np.where( df.numbers.isin([11]),       "purple", df.numbers2 )

That's not going to be as efficient as np.select which is probably the most efficient way to do this (although I didn't time it), but it is arguably more readable in that you can put each key/value pair on the same line.

Upvotes: 3

DSM
DSM

Reputation: 353179

Not sure about elegance, but if you make a dict of the old to new categories, something like (note the added 'purple'):

>>> m = {"red": [3,4,5], "green": [1,6,7], "blue": [2,8,9,10], "purple": [11]}
>>> m2 = {v: k for k,vv in m.items() for v in vv}
>>> m2
{1: 'green', 2: 'blue', 3: 'red', 4: 'red', 5: 'red', 6: 'green', 
 7: 'green', 8: 'blue', 9: 'blue', 10: 'blue', 11: 'purple'}

You can use this to build a new categorical Series:

>>> df.cat.map(m2).astype("category", categories=set(m2.values()))
0    green
1     blue
2      red
3      red
4      red
5    green
6    green
7     blue
8     blue
9     blue
Name: cat, dtype: category
Categories (4, object): [green, purple, red, blue]

You don't need the categories=set(m2.values()) (or an ordered equivalent if you care about the categorical ordering) if you're sure that all categorical values will be seen in the column. But here, if we didn't do that, we wouldn't have seen purple in the resulting Categorical, because it was building it from the categories it actually saw.

Of course if you already have your list ['green','blue','red', etc.] built it's equally easy just to use it to make a new categorical column directly and bypass this mapping entirely.

Upvotes: 14

Related Questions