Reputation: 111
I've scoured many, many other SO posts for an answer to this question, but haven't found quite what I'm looking for. Here goes:
Let's say we have a dataframe that looks like this:
In [7]: df.head(5)
Out[7]:
bool_flag group int_flag
0 False bottom 0
1 False mid 1
2 False top 1
3 False top 0
4 False high 1
Where there are five unique groups, two unique boolean values, and two unique integer values. I'd like to create a summary table like this:
bottom low mid high top
bool_flag true 5 32 2 12 4
false 2 42 7 2 10
int_flag 0 1 10 15 3 8
1 10 31 14 0 1
summarizing the unique value counts of each of the non-group
columns, and grouped in columns of group
.
I've gotten close. The following pivot_table
command get me tables that resemble components of what I'd like to have.
In [8]: pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)
Out[8]:
group bottom high low mid top
int_flag
0 15 11 8 13 13
1 12 5 8 9 6
In [9]: pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
Out[9]:
group bottom high low mid top
bool_flag
False 19 14 15 18 16
True 8 2 1 4 3
However, the index
of the resulting table isn't the Multiindex
I'd like to have, and thus makes concatenating that pivot table with the same for the bool_flag
more difficult.
I would hope that there's a way to either use groupby
or pivot_table
to get what I want without generating these sub-tabulations and concatenating them, but so far I haven't been able to find it. Pivoting with multiple index columns results in too fine-grained a table (I don't want the count of (False, 0)
pairs for (bool_flag, int_flag)
values, for example, just the count of each unique value within each group
.)
I also attempted to use groupby('group').agg(f)
, where I defined f
to yield the result of calling value_counts()
on each series. However, agg
isn't compatible with functions that return DataFrame
s.
Any suggestions would be greatly appreciated!
Upvotes: 4
Views: 1327
Reputation: 111
Actually, I don't think that what I'm asking for is possible. Setting the indices of the two pivot tables I showed above to MultiIndex
s by doing the following:
x = pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
y = pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)
def multiindex_from_pivot_result(df):
return pd.MultiIndex.from_tuples([(df.index.name, val) for val in df.index], names=['feature', 'values'])
xx = x.set_index(multiindex_from_pivot_result(x))
yy = y.set_index(multiindex_from_pivot_result(y))
results in a tables that look like this:
group bottom high low mid top
feature values
bool_flag False 19 14 15 18 16
True 8 2 1 4 3
and
group bottom high low mid top
feature values
int_flag 0 15 11 8 13 13
1 12 5 8 9 6
however, concatenating them like so
pd.concat([yy, xx])
yields a table with the values I want, but whose index's second-level is overridden with the first frame's index values.
In [24]: pd.concat([yy, xx])
Out[24]:
group bottom high low mid top
feature values
int_flag 0 15 11 8 13 13
1 12 5 8 9 6
bool_flag 0 19 14 15 18 16
1 8 2 1 4 3
Unfortunately, that leaves me with the choice of resetting that level of the index to a normal column, which doesn't print as nicely.
Hope this helped somebody!
Upvotes: 2