Lynn
Lynn

Reputation: 4398

Separate aggregated row into distinct rows, adding a unique count in pandas

I have a dataset that has several rows and columns, however within the column labeled, 'active', I wish to remove the aggregation and separate this into its own unique count. I also wish to add a column that gives this count a unique id.

Data

Pair    gen     box    date     active

sox     black    bl     2021     3
sox     red      re     2021     2

Desired

Pair    gen     box    date     active    count

sox     black      bl     2021     1      b101
sox     black      bl     2021     1      bl02
sox     black      bl     2021     1      bl02
sox     red        re     2021     1      re01
sox     red        re     2021     1      re02

The 'active' column now has 3 and 2 distinct rows respectively, along with a distinct count ID (based on the 'box' column) instead of an aggregation.

Doing

A SO assisted with a similar dilemma, and I am taking this approach:

# Melt Table Into New Form
df = df.melt(col_level=0, value_name='count', var_name='bl')

# Repeat Based on Count
df = df.reindex(df.index.repeat(df['count']))
# Set Count To 1
df['count'] = 1


# Add Suffix to Each ID
df['ID'] = df['ID'] + (
    (df.groupby('ID').cumcount() + 1)
        .astype(str)
        .str.zfill(2)
)

This works, however, I am not able to retain the other columns as well as set individual var names such as:

bl01
bl02

re01
re02

Any suggestion or advice is appreciated

Upvotes: 2

Views: 71

Answers (2)

Umar.H
Umar.H

Reputation: 23099

another method would be a concat along your axis, then use cumcount to create your count column.

dfs = pd.concat([pd.concat([y.assign(active=1,)] * act)
           for (idx,act),y in df.groupby([df.index, df['active']])])

dfs['count'] = dfs['box'] + (dfs.groupby('box').cumcount() + 1).astype(str).str.zfill(2)

print(dfs)

  Pair    gen box  date  active count
0  sox  black  bl  2021       1  bl01
0  sox  black  bl  2021       1  bl02
0  sox  black  bl  2021       1  bl03
1  sox    red  re  2021       1  re01
1  sox    red  re  2021       1  re02

Upvotes: 1

David Erickson
David Erickson

Reputation: 16683

You don't need melt for this problem and can reindex and repeat with the following. Keep in mind that for the count column, I realize that if each group has a size > 10 that it would show up as bl010 as an example, when you really want bl10. The number formatting works with the provided sample data though with group size less than 10, and with group size > 10, it is not clear what the desired output is:

df = df.reindex(df.index.repeat(df['active'])).assign(active=1)
df['count'] = df['box'] + '0' + (df.groupby(['Pair', 'gen', 'box']).cumcount() + 1).astype(str)
df
Out[1]: 
  Pair    gen box  date  active count
0  sox  black  bl  2021       1  bl01
0  sox  black  bl  2021       1  bl02
0  sox  black  bl  2021       1  bl03
1  sox    red  re  2021       1  re01
1  sox    red  re  2021       1  re02

Upvotes: 4

Related Questions