Federico Dorato
Federico Dorato

Reputation: 784

Finding which values are being grouped together in a Pandas Dataframe

I have the following function, which takes as input a dataframe and another parameter named "ratio"

def grouper(df, ratio):
    if grouping > 0:
        return df.apply(lambda x: x.mask(x.map(x.value_counts()) < len(df) * ratio, 'other'))
    return df

This function group together those values which appear less frequently.

If my Dataframe were to be something like

>>> df

   Country   Manager
0    Italy     Pippo
1   France     Pluto
2  Germany     Pippo
3    Italy     Pluto
4   France     Pippo
5    Spain     Pluto
6    Italy  Paperino
7   France  Topolino
8   Norway    Minnie

Then using the above-mentioned function I would have:

>>> grouper(df, 0.2)

  Country Manager
0   Italy   Pippo
1  France   Pluto
2   other   Pippo
3   Italy   Pluto
4  France   Pippo
5   other   Pluto
6   Italy   other
7  France   other
8   other   other

Now, I want to find a way to mark down which values have been changed. My desired output is something like this:

{
    "City" : ["Germany", "Spain", "Norway"],
    "Manager" : ["Paperino", "Topolino", "Minnie"]
}

How can I obtain this?

Upvotes: 0

Views: 40

Answers (2)

jezrael
jezrael

Reputation: 863226

Use dictioanry comprehension with filtering each column:

def grouper(df, ratio):
    if ratio > 0:
        d={x:df.loc[df[x].map(df[x].value_counts()) < len(df) * ratio, x].unique().tolist() 
              for x in df.columns}
        return d
    return df

df = grouper(df, 0.2)
print (df)
{'Country': ['Germany', 'Spain', 'Norway'], 'Manager': ['Paperino', 'Topolino', 'Minnie']}

Upvotes: 1

Federico Dorato
Federico Dorato

Reputation: 784

I managed to do it in the most bloody way possible:

def grouper_cat(df, grouping):
    dictionaries = df.apply(
        lambda x: (
            lambda y=x.value_counts() : (
                lambda z =y[y<len(df)*grouping] : {z.name:(z).index.tolist()}
                )()
            )()
        ).values
    result = {}
    for d in dictionaries:
        result.update(d)
    return result

Example:

>>> grouper_cat(df, 0.2)

{'Country': ['Norway', 'Germany', 'Spain'],
 'Manager': ['Topolino', 'Paperino', 'Minnie']}

Note:

Compared to @jezrael answer (the new, edited one), my solution is apparently faster

>>> timeit(lambda : grouper_cat(df, 0.2), number=2500)
6.257032366998828

>>> timeit(lambda : grouper_cat_jez(df, 0.2), number=2500)
8.312444757999401

Upvotes: 0

Related Questions