Reputation: 7167

Pandas: Help transforming data and writing better code

I have two data sources I can join by a field and want to summarize them in a chart:

Data

The two DataFrames share column A:

ROWS = 1000
df = pd.DataFrame.from_dict({'A': np.arange(ROWS),
                             'B': np.random.randint(0, 60, size=ROWS),
                             'C': np.random.randint(0, 100, size=ROWS)})
df.head()

   A   B   C
0  0  10  11
1  1   7  64
2  2  22  12
3  3   1  67
4  4  34  57

And other which I joined as such:

other = pd.DataFrame.from_dict({'A': np.arange(ROWS),
                                'D': np.random.choice(['One', 'Two'], ROWS)})
other.set_index('A', inplace=True)
df = df.join(other, on=['A'], rsuffix='_right')
df.head()

   A   B   C    D
0  0  10  11  One
1  1   7  64  Two
2  2  22  12  One
3  3   1  67  Two
4  4  34  57  One

Question

A proper way to get a column chart with the count of:

C is GTE50 and D is One
C is GTE50 and D is Two
C is LT50 and D is One
C is LT50 and D is Two

Grouped by B, binned into 0, 1-10, 11-20, 21-30, 21-40, 41+.

Upvotes: 3

Answers (4)

Preetham

Reputation: 577

Tried a different way of doing it.

df['Bins'] = np.where(df['B'].isin([0]), '0',
            np.where(df['B'].isin(range(1,11)), '1-10',
            np.where(df['B'].isin(range(11,21)), '11-20',
            np.where(df['B'].isin(range(21,31)), '21-30',
            np.where(df['B'].isin(range(31,40)), '31-40','41+')
            ))))

df['Class_type'] = np.where(((df['C']>50) & (df['D']== 'One') ), 'C is GTE50 and D is One',
            np.where(((df['C']>50) & (df['D']== 'Two')) , 'C is GTE50 and D is Two',
            np.where(((df['C']<50) & (df['D']== 'One') ), 'C is LT50 and D is One',
                     'C is LT50 and D is Two')
            ))


df.groupby(['Bins', 'Class_type'])['C'].sum().unstack().plot(kind='bar')
plt.show()

#### Output ####

WARNING: Not sure how optimal the solution is.And also it consumes extra space so space complexity may increase.

Upvotes: 1

piRSquared

Reputation: 294546

Numpy

Using numpy arrays to count then construct the DataFrame to plot

labels = np.array(['0', '1-10', '11-20', '21-30', '31-40', '41+'])
ge_lbl = np.array(['GE50', 'LT50'])

u, d = np.unique(df.D.values, return_inverse=True)
bins = np.array([1, 11, 21, 31, 41]).searchsorted(df.B)
ltge = (df.C.values >= 50).astype(int)

shape = (len(u), len(labels), len(ge_lbl))
out = np.zeros(shape, int)
np.add.at(out, (d, bins, ltge), 1)

pd.concat({
    d_: pd.DataFrame(o, labels, ge_lbl)
    for d_, o in zip(u, out)
}, names=['Cx', 'D'], axis=1).plot.bar()

Upvotes: 1

ALollz

Reputation: 59579

IIUC, this can be dramatically simplified to a single groupby, taking advantage of clip and np.ceil to form your groups. A single unstack with 2 levels gives us the B-grouping as our x-axis with bars for each D-C combination:

If you want slightly nicer labels, you can map the groupby values:

(df.groupby(['D', 
             df.C.ge(50).map({True: 'GE50', False: 'LT50'}),
             np.ceil(df.B.clip(lower=0, upper=41)/10).map({0: '0', 1: '1-10', 2: '11-20', 3: '21-30', 4: '31-40', 5: '41+'})
            ])
     .size().unstack([0,1]).plot.bar())

Also it's equivalent to group B on:

pd.cut(df['B'],
       bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
       right=False,
       labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])

Upvotes: 6

Leonel Galán

Reputation: 7167

I arrived to this solution after days of grinding, going back and forth, but there are many things I consider code smells:

groupby returns a sort-of pivot table and melt's purpose is to unpivot data.
The use of dummies for Cx, but not for D? Ultimately they are both categorical data with 2 options. After two days, when I got this first solution I needed a break before trying another branch that treat these two equally.
reset_index, only to set_index lines later. Having to sort_values before set_index
That last summary.unstack().unstack() reads like a big hack.

# %% Cx
df['Cx'] = df['C'].apply(lambda x: 'LT50' if x < 50 else 'GTE50')
df.head()

# %% Bins
df['B_binned'] = pd.cut(df['B'],
                        bins=[-np.inf, 1, 11, 21, 31, 41, np.inf],
                        right=False,
                        labels=['0', '1-10', '11-20', '21-30', '31-40', '41+'])
df.head()

# %% Dummies
s = df['D']
dummies = pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
df = pd.concat([df, dummies], axis=1)

df.head()

# %% Summary
summary = df.groupby(['B_binned', 'Cx']).agg({'One': 'sum', 'Two': 'sum'})

summary.reset_index(inplace=True)
summary = pd.melt(summary,
                  id_vars=['B_binned', 'Cx'],
                  value_vars=['One', 'Two'],
                  var_name='D',
                  value_name='count')
summary.sort_values(['B_binned', 'D', 'Cx'], inplace=True)
summary.set_index(['B_binned', 'D', 'Cx'], inplace=True)
summary

# %% Chart
summary.unstack().unstack().plot(kind='bar')

Upvotes: 1

Pandas: Help transforming data and writing better code

Data

Question

Answers (4)

Numpy

Related Questions