Ali Crash
Ali Crash

Reputation: 563

How to draw proper chart of distributional tree?

I am using python with matplotlib and need to visualize distribution percentage of sub-groups of an data set.

imagine this tree:

Data --- group1 (40%)
     -
     --- group2 (25%)
     -
     --- group3 (35%)


group1 --- A (25%)
       -
       --- B (25%)
       -
       --- c (50%)

and it can go on, each group can have several sub-groups and same for each sub group.

How can i plot a proper chart for this info?

Upvotes: 3

Views: 418

Answers (2)

AlCorreia
AlCorreia

Reputation: 552

I created a minimal reproducible example that I think fits your description, but please let me know if that is not what you need.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)

For instance, we could get the following counts for the subgroups.

In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group  subgroup
    1   A      17
        C      16
        B      5
    2   A      23
        C      10
        B      7
    3   C      8
        A      7
        B      7
 Name: subgroup, dtype: int64

I created a function that computes the necessary counts given an ordering of the columns (e.g. ['group', 'subgroup']) and incrementally plots the bars with the corresponding percentages.

import matplotlib.pyplot as plt
import matplotlib.cm

def plot_tree(data, ordering, axis=False):
    """
    Plots a sequence of bar plots reflecting how the data 
    is distributed at different levels. The order of the 
    levels is given by the ordering parameter.

    Parameters
    ----------
    data: pandas DataFrame
    ordering: list
        Names of the columns to be plotted.They should be 
        ordered top down, from the larger to the smaller group.
    axis: boolean
        Whether to plot the axis.

    Returns
    -------
    fig: matplotlib figure object.
        The final tree plot.
    """

    # Frame set-up
    fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
    ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
    ax.set_xticklabels(['All'] + ordering, fontsize=18)
    if not axis:
        plt.axis('off')
    counts=[data.shape[0]]

    # Get colormap
    labels = ['All']
    for o in reversed(ordering):
        labels.extend(data[o].unique().tolist())
    # Pastel is nice but has few colors. Change for a larger map if needed
    cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
    colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))

    # Group the counts
    counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
    for i, o in enumerate(ordering[:-1], 1):
        if ordering[:i]:
            counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
    # Calculate percentages
    counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
    for i, o in enumerate(ordering[1:], 1):
        counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]

    # Plot first bar - all data
    ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
    ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
    comb = 1  # keeps track of the number of possible combinations at each level
    for bar, col in enumerate(ordering):
        labels = sorted(data[col].unique())*comb
        comb *= len(data[col].unique())
        # Get only the relevant counts at this level
        local_counts = counts[ordering[:bar+1] + 
                              ['c_' + o for o in ordering[:bar+1]] + 
                              ['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
        sizes = local_counts['c_' + col]
        percs = local_counts['p_' + col]
        bottom = 0  # start at from 0
        for size, perc, label in zip(sizes, percs, labels):
            ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
            ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
            bottom += size  # stack the bars
    ax.legend(colors)
    return fig

With the data shown above we would get the following.

fig = plot_tree(data, ['group', 'subgroup'], axis=True)

Tree plot example

Upvotes: 2

Related Questions