Reputation: 137
I want to create the below column Stories_%
that categorizes unique value %'s.
SubID Stories Stories_%
0 102F2 2 1 (20%), 2 (80%)
1 102F2 2 1 (20%), 2 (80%)
2 102F2 2 1 (20%), 2 (80%)
3 102F2 2 1 (20%), 2 (80%)
4 102F2 1 1 (20%), 2 (80%)
5 132F2 2 1 (60%), 2 (40%)
6 132F2 1 1 (60%), 2 (40%)
7 132F2 1 1 (60%), 2 (40%)
8 132F2 2 1 (60%), 2 (40%)
9 132F2 1 1 (60%), 2 (40%)
I would drop the duplicates and drop Stories
so the final table would look like this:
SubID Stories_%
0 102F2 1 (20%), 2 (80%)
1 132F2 1 (60%), 2 (40%)
I started by creating a list with all of the unique values (below):
unique_stories = df.groupby(["SubID"])['Stories'].unique().astype(int)
unique_stories
SubID
001C5 ['2']
001C6 ['2']
002I2 ['2']
004C6 ['2']
005L2 ['2']
...
709E1 ['1' '2']
725E1 ['2' '1']
730E1 ['2' '1']
The issue I'm having is getting the %s for each unique count.
I've tried...
df.groupby(['SubID'])['Stories'].count()
...but this only creates a total count rather than a count of each unique value within that category.
I've researched a number of posts but can only turn up total counts, nunique counts, and count of unique values over the entire column, not by group etc.
Is there a method for doing this?
Upvotes: 1
Views: 44
Reputation: 71689
Method 1: You can use groupby
+ value_counts
:
s = df.groupby('SubID')['Stories']\
.value_counts(normalize=True, sort=False).map('{:.0%}'.format)
(s.index.get_level_values(1).astype(str) + ' (' + s + ')')\
.groupby(level=0).agg(','.join).reset_index(name='Stories %')
Method 2: Or you can use Counter
from collections
:
def normalize():
d = {}
for i, g in df.groupby('SubID'):
c = Counter(g['Stories'])
d[i] = ', '.join(f'{k} ({v / sum(c.values()):.0%})'
for k, v in c.items())
return d
pd.Series(normalize(), name='Stories %').rename('SubID').reset_index()
SubID Stories %
0 102F2 1 (20%),2 (80%)
1 132F2 1 (60%),2 (40%)
Upvotes: 2
Reputation: 59519
If you're okay with a slightly different format that stores the same information but, leverages a MultiIndex (which will make future manipulations and calculations simpler) then we can get this with gropuby
+ size
and then sum
.
s = df.groupby(['SubID', 'Stories']).size()
s = s/s.sum(level=0) # Divide by the total number within each `'SubID'`
SubID Stories
102F2 1 0.2
2 0.8
132F2 1 0.6
2 0.4
dtype: float64
Upvotes: 1