Reputation: 7167
I have two data sources I can join by a field and want to summarize them in a chart:
The two DataFrames share column A
:
ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'B': np.random.randint(0, 60, size=ROWS),
'C': np.random.randint(0, 100, size=ROWS)})
df.head()
A B C
0 0 10 11
1 1 7 64
2 2 22 12
3 3 1 67
4 4 34 57
And other
which I joined as such:
other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()
A B C D
0 0 10 11 One
1 1 7 64 Two
2 2 22 12 One
3 3 1 67 Two
4 4 34 57 One
A proper way to get a column chart with the count of:
Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.
Upvotes: 3
Views: 109
Reputation: 577
Tried a different way of doing it.
df['Bins'] = np.where(df['B'].isin([0]), '0',
np.where(df['B'].isin(range(1,11)), '1-10',
np.where(df['B'].isin(range(11,21)), '11-20',
np.where(df['B'].isin(range(21,31)), '21-30',
np.where(df['B'].isin(range(31,40)), '31-40','41+')
))))
df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
'C is LT50 and D is Two')
))
df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()
#### Output ####
WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.
Upvotes: 1
Reputation: 294546
Using numpy arrays to count then construct the DataFrame
to plot
labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])
u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)
shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)
pd.concat({
d_: pd.DataFrame(o, labels, ge_lbl)
for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()
Upvotes: 1
Reputation: 59579
IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip
and np.ceil
to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:
If you want slightly nicer labels, you can map the groupby
values:
(df.groupby(['D',
df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
])
.size().unstack([0,1]).plot.bar())
Also it's equivalent to group B on:
pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
Upvotes: 6
Reputation: 7167
I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:
groupby
returns a sort-of pivot table and melt
's purpose is to unpivot data.Cx
, but not for D
? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.reset_index
, only to set_index
lines later. Having to sort_values
before set_index
summary.unstack().unstack()
reads like a big hack.# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()
# %% Bins
df['B_binned'] = pd.cut(df['B'],
bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
right=False,
labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()
# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)
df.head()
# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})
summary.reset_index(inplace=True)
summary = pd.melt(summary,
id_vars=['B_binned', 'Cx'],
value_vars=['One', 'Two'],
var_name='D',
value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary
# %% Chart
summary.unstack().unstack().plot(kind='bar')
Upvotes: 1