Tickets2Moontown
Tickets2Moontown

Reputation: 137

% of Unique Values by Group

I want to create the below column Stories_% that categorizes unique value %'s.

    SubID   Stories   Stories_%
0   102F2   2         1 (20%), 2 (80%)
1   102F2   2         1 (20%), 2 (80%)
2   102F2   2         1 (20%), 2 (80%)
3   102F2   2         1 (20%), 2 (80%)
4   102F2   1         1 (20%), 2 (80%)
5   132F2   2         1 (60%), 2 (40%)
6   132F2   1         1 (60%), 2 (40%)
7   132F2   1         1 (60%), 2 (40%)
8   132F2   2         1 (60%), 2 (40%)
9   132F2   1         1 (60%), 2 (40%)

I would drop the duplicates and drop Stories so the final table would look like this:

    SubID   Stories_%
0   102F2   1 (20%), 2 (80%)
1   132F2   1 (60%), 2 (40%)

I started by creating a list with all of the unique values (below):

unique_stories = df.groupby(["SubID"])['Stories'].unique().astype(int)
unique_stories
SubID
001C5        ['2']
001C6        ['2']
002I2        ['2']
004C6        ['2']
005L2        ['2']
           ...    
709E1    ['1' '2']
725E1    ['2' '1']
730E1    ['2' '1']

The issue I'm having is getting the %s for each unique count.

I've tried...

df.groupby(['SubID'])['Stories'].count()

...but this only creates a total count rather than a count of each unique value within that category.

I've researched a number of posts but can only turn up total counts, nunique counts, and count of unique values over the entire column, not by group etc.

Is there a method for doing this?

Upvotes: 1

Views: 44

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Method 1: You can use groupby + value_counts:

s = df.groupby('SubID')['Stories']\
      .value_counts(normalize=True, sort=False).map('{:.0%}'.format)

(s.index.get_level_values(1).astype(str) + ' (' + s + ')')\
   .groupby(level=0).agg(','.join).reset_index(name='Stories %')

Method 2: Or you can use Counter from collections:

def normalize():
    d = {}
    for i, g in df.groupby('SubID'):
        c = Counter(g['Stories'])
        d[i] = ', '.join(f'{k} ({v / sum(c.values()):.0%})'
                         for k, v in c.items())
    return d

pd.Series(normalize(), name='Stories %').rename('SubID').reset_index()

   SubID        Stories %
0  102F2  1 (20%),2 (80%)
1  132F2  1 (60%),2 (40%)

Upvotes: 2

ALollz
ALollz

Reputation: 59519

If you're okay with a slightly different format that stores the same information but, leverages a MultiIndex (which will make future manipulations and calculations simpler) then we can get this with gropuby + size and then sum.

s = df.groupby(['SubID', 'Stories']).size()
s = s/s.sum(level=0)  # Divide by the total number within each `'SubID'`

SubID  Stories
102F2  1          0.2
       2          0.8
132F2  1          0.6
       2          0.4
dtype: float64

Upvotes: 1

Related Questions