silver
silver

Reputation: 111

How to create summary table of unique value counts?

I've scoured many, many other SO posts for an answer to this question, but haven't found quite what I'm looking for. Here goes:

Let's say we have a dataframe that looks like this:

In [7]: df.head(5)
Out[7]:
  bool_flag   group  int_flag
0     False  bottom         0
1     False     mid         1
2     False     top         1
3     False     top         0
4     False    high         1

Where there are five unique groups, two unique boolean values, and two unique integer values. I'd like to create a summary table like this:

                 bottom   low  mid  high  top
bool_flag  true       5     32   2    12    4
          false       2     42   7     2   10
int_flag      0       1     10  15     3    8 
              1      10     31  14     0    1

summarizing the unique value counts of each of the non-group columns, and grouped in columns of group.

I've gotten close. The following pivot_table command get me tables that resemble components of what I'd like to have.

In [8]: pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)
Out[8]:
group     bottom  high  low  mid  top
int_flag
0             15    11    8   13   13
1             12     5    8    9    6


In [9]: pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
Out[9]:
group      bottom  high  low  mid  top
bool_flag
False          19    14   15   18   16
True            8     2    1    4    3

However, the index of the resulting table isn't the Multiindex I'd like to have, and thus makes concatenating that pivot table with the same for the bool_flag more difficult.

I would hope that there's a way to either use groupby or pivot_table to get what I want without generating these sub-tabulations and concatenating them, but so far I haven't been able to find it. Pivoting with multiple index columns results in too fine-grained a table (I don't want the count of (False, 0) pairs for (bool_flag, int_flag) values, for example, just the count of each unique value within each group.)

I also attempted to use groupby('group').agg(f), where I defined f to yield the result of calling value_counts() on each series. However, agg isn't compatible with functions that return DataFrames.

Any suggestions would be greatly appreciated!

Upvotes: 4

Views: 1327

Answers (1)

silver
silver

Reputation: 111

Actually, I don't think that what I'm asking for is possible. Setting the indices of the two pivot tables I showed above to MultiIndexs by doing the following:

x = pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
y = pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)

def multiindex_from_pivot_result(df):
    return pd.MultiIndex.from_tuples([(df.index.name, val) for val in df.index], names=['feature', 'values'])

xx = x.set_index(multiindex_from_pivot_result(x))
yy = y.set_index(multiindex_from_pivot_result(y))

results in a tables that look like this:

group             bottom  high  low  mid  top
feature   values
bool_flag False       19    14   15   18   16
          True         8     2    1    4    3

and

group            bottom  high  low  mid  top
feature  values
int_flag 0           15    11    8   13   13
         1           12     5    8    9    6

however, concatenating them like so

pd.concat([yy, xx])

yields a table with the values I want, but whose index's second-level is overridden with the first frame's index values.

In [24]: pd.concat([yy, xx])
Out[24]:
group             bottom  high  low  mid  top
feature   values
int_flag  0           15    11    8   13   13
          1           12     5    8    9    6
bool_flag 0           19    14   15   18   16
          1            8     2    1    4    3

Unfortunately, that leaves me with the choice of resetting that level of the index to a normal column, which doesn't print as nicely.

Hope this helped somebody!

Upvotes: 2

Related Questions