Tim
Tim

Reputation: 2050

Pandas histogram from count of columns

I have a big dataframe that consist of about 6500 columns where one is a classlabel and the rest are boolean values of either 0 or 1, the dataframe is sparse.

example:

df = pd.DataFrame({
            'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
            'x1' : np.random.choice(2, 8),
            'x2' : np.random.choice(2, 8),
            'x3' : np.random.choice(2, 8)})

What I want is a report (preferably in pandas so I can plot it easily) that shows me the sum of unique elements of the columns grouped by the label.

So for example this data frame:

    x1  x2  x3  label
0   0   1   1   a
1   1   0   1   b
2   0   1   0   c
3   1   0   0   b
4   1   1   1   a
5   0   0   1   c
6   1   0   0   b
7   0   1   0   a

The result should be something like this:

a: 3 (since it has x1, x2 and x3)
b: 2 (since it has x1, x3)
c: 2 (since it has x2, x3)

So it's kind of a count of which columns are present in each label. Think of a histogram where the x-axis is the label and the y-axis the number of columns.

Upvotes: 2

Views: 1081

Answers (2)

matt_s
matt_s

Reputation: 1075

You could try pivoting:

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
        'x1' : np.random.choice(2, 8),
        'x2' : np.random.choice(2, 8),
        'x3' : np.random.choice(2, 8)})

pd.pivot_table(df, index='label').transpose().apply(np.count_nonzero)

For df:

label   x1  x2  x3
0   a   0   0   0
1   b   0   1   0
2   c   1   0   1
3   b   0   1   0
4   a   1   1   1
5   c   1   0   1
6   b   0   1   0
7   a   1   1   1

The result is:

label
a    3
b    1
c    2
dtype: int64

Upvotes: 3

G.S
G.S

Reputation: 402

label = df.groupby('label')
for key,val in label.count()['x1'].iteritems():
    strg = '%s:%s' %(key,val)
    for col,vl in label.sum().ix[key].iteritems():
        if vl!=0:
            strg += ' %s'%col
    print strg

Upvotes: 0

Related Questions