barwuah
barwuah

Reputation: 51

Python df groupby with agg for string and sum

With this df DF as base i want the following output: resultDF

So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.

With the following code i could aggregate the strings but without summing the numbers:

df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2

uncompleteSolutiondf

Upvotes: 1

Views: 1428

Answers (2)

johnpaton
johnpaton

Reputation: 765

This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped

df_grouped = df.groupby([0,1]).sum().reset_index()

Then reshape to the form you want:

def group_to_row(group):
    group = group.sort_values(1)
    output = []
    for i, row in group[[1,2]].iterrows():
        output += row.tolist()
    return pd.DataFrame(data=[output])

df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()

This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.

Upvotes: 1

jpp
jpp

Reputation: 164693

Avoid an arbitrary number of columns

Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.

So you can just use groupby + sum to store all the information you require.

df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
                   1: ['Mat1', 'Mat2', 'Mat2'],
                   2: [3, 1, 1]})

df_sum = df.groupby([0, 1]).sum().reset_index()

print(df_sum)
               0     1  2
0  2008-04_E.pdf  Mat1  3
1  2008-04_E.pdf  Mat2  2

But if you insist...

If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.

key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)

res.columns = res.columns.droplevel(0)

print(res)
                  Key1  Key2  Key1  Key2
0  2008-04_E.pdf  Mat1  Mat2     3     2

Upvotes: 1

Related Questions