Reputation: 563
I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.
imagine this tree:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
and it can go on, each group can have several sub-groups and same for each sub group.
How can i plot a proper chart for this info?
Upvotes: 3
Views: 418
Reputation: 552
I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
For instance, we could get the following counts for the subgroups.
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']
) and incrementally plots the bars with the corresponding percentages.
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
With the data shown above we would get the following.
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
Upvotes: 2