dodo4545
dodo4545

Reputation: 315

Join and count number of values in each subcategory

Lets say a have a data frame that looks like this:

 V290   V311
0   GOOD    TOP QUARTER
1   NK-UNASCERTAIN  MIDDLE HALF
2   AVERAGE TOP QUARTER
3   POOR    NK-UNASCERTAIN
4   POOR    MIDDLE HALF
5   GOOD    MIDDLE HALF
6   POOR    TOP QUARTER
7   AVERAGE MIDDLE HALF
8   POOR    MIDDLE HALF
9   AVERAGE MIDDLE HALF
10  POOR    MIDDLE HALF
11  POOR    MIDDLE HALF
12  AVERAGE MIDDLE HALF
13  AVERAGE TOP QUARTER

And i want to group this data by ['V311'] and see how many GOOD or POOR would be in Every ['V311'] Subcategory. I want to make something like this:

  Top Quarter:GOOD:12
              POOR:30
              Average:15
  Middle half:GOOD:5
              POOR:19
              Average:3

And so on...

Upvotes: 1

Views: 571

Answers (3)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

You can do pivot with unstack i.e

df.pivot_table(index='V290',columns='V311',aggfunc='size',fill_value=0).unstack()

V311            V290          
MIDDLE HALF     AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
                POOR              4
NK-UNASCERTAIN  AVERAGE           0
                GOOD              0
                NK-UNASCERTAIN    0
                POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                NK-UNASCERTAIN    0
                POOR              1
dtype: int64

Also :

df.groupby(['V290','V311']).size().unstack().fillna(0).unstack()

If you want the percentage then you can divide by the sum i.e

ndf = df.pivot_table(index='V290',columns='V311',aggfunc='size',fill_value=0)
percents = (ndf/ndf.sum()*100).unstack()

V311            V290          
MIDDLE HALF     AVERAGE            33.333333
                GOOD               11.111111
                NK-UNASCERTAIN     11.111111
                POOR               44.444444
NK-UNASCERTAIN  AVERAGE             0.000000
                GOOD                0.000000
                NK-UNASCERTAIN      0.000000
                POOR              100.000000
TOP QUARTER     AVERAGE            33.333333
                GOOD               33.333333
                NK-UNASCERTAIN      0.000000
                POOR               33.333333
dtype: float64

Upvotes: 3

jezrael
jezrael

Reputation: 863166

Use dict comprehenion with groupby, value_counts and convert to dict:

d = {k:v.value_counts().to_dict() for k,v in df.groupby('V311')['V290']}
print (d)
{'NK-UNASCERTAIN': {'POOR': 1}, 
'MIDDLE HALF': {'POOR': 4, 'NK-UNASCERTAIN': 1, 'AVERAGE': 3, 'GOOD': 1}, 
'TOP QUARTER': {'POOR': 1, 'AVERAGE': 2, 'GOOD': 1}}

For output as Series:

s = df.groupby('V311')['V290'].value_counts()
print (s)
V311            V290          
MIDDLE HALF     POOR              4
                AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
NK-UNASCERTAIN  POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                POOR              1
Name: V290, dtype: int64

EDIT: If need relative frequencies:

s = df.groupby('V311')['V290'].value_counts(normalize=True)
print (s)
V311            V290          
MIDDLE HALF     POOR              0.444444
                AVERAGE           0.333333
                GOOD              0.111111
                NK-UNASCERTAIN    0.111111
NK-UNASCERTAIN  POOR              1.000000
TOP QUARTER     AVERAGE           0.500000
                GOOD              0.250000
                POOR              0.250000
Name: V290, dtype: float64

EDIT1:

If want all missing categories too:

s = df.groupby('V311')['V290'].value_counts()
s = s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=0)
print (s)
MIDDLE HALF     AVERAGE           3
                GOOD              1
                NK-UNASCERTAIN    1
                POOR              4
NK-UNASCERTAIN  AVERAGE           0
                GOOD              0
                NK-UNASCERTAIN    0
                POOR              1
TOP QUARTER     AVERAGE           2
                GOOD              1
                NK-UNASCERTAIN    0
                POOR              1
Name: V290, dtype: int64

Upvotes: 3

Rooler
Rooler

Reputation: 29

Using only pandas:

import pandas as pd
dataframe = pd.DataFrame()
dataframe['V311'] = ['MIDDLE','TOP','MIDDLE','TOP','MIDDLE','TOP','TOP']
print(dataframe['V311'].value_counts())

output:

TOP       4
MIDDLE    3
Name: V311, dtype: int64

Upvotes: 1

Related Questions