Reputation: 2050
I have a big dataframe that consist of about 6500 columns where one is a classlabel and the rest are boolean values of either 0 or 1, the dataframe is sparse.
example:
df = pd.DataFrame({
'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
'x1' : np.random.choice(2, 8),
'x2' : np.random.choice(2, 8),
'x3' : np.random.choice(2, 8)})
What I want is a report (preferably in pandas so I can plot it easily) that shows me the sum of unique elements of the columns grouped by the label.
So for example this data frame:
x1 x2 x3 label
0 0 1 1 a
1 1 0 1 b
2 0 1 0 c
3 1 0 0 b
4 1 1 1 a
5 0 0 1 c
6 1 0 0 b
7 0 1 0 a
The result should be something like this:
a: 3 (since it has x1, x2 and x3)
b: 2 (since it has x1, x3)
c: 2 (since it has x2, x3)
So it's kind of a count of which columns are present in each label. Think of a histogram where the x-axis is the label
and the y-axis the number of columns
.
Upvotes: 2
Views: 1081
Reputation: 1075
You could try pivoting:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'label' : ['a', 'b', 'c', 'b','a', 'c', 'b', 'a'],
'x1' : np.random.choice(2, 8),
'x2' : np.random.choice(2, 8),
'x3' : np.random.choice(2, 8)})
pd.pivot_table(df, index='label').transpose().apply(np.count_nonzero)
For df:
label x1 x2 x3
0 a 0 0 0
1 b 0 1 0
2 c 1 0 1
3 b 0 1 0
4 a 1 1 1
5 c 1 0 1
6 b 0 1 0
7 a 1 1 1
The result is:
label
a 3
b 1
c 2
dtype: int64
Upvotes: 3
Reputation: 402
label = df.groupby('label')
for key,val in label.count()['x1'].iteritems():
strg = '%s:%s' %(key,val)
for col,vl in label.sum().ix[key].iteritems():
if vl!=0:
strg += ' %s'%col
print strg
Upvotes: 0