ZachGutz
ZachGutz

Reputation: 61

How to create a new 'index' column after using pandas DataFrame.explode()?

I'm using DataFrame.explode() to unnest a column of lists such that each element gets its own row. What I'm wondering is how to create a new 'index' column that will correspond to the index of the element in the original list. In the example I'm calling this column 'cycle'.

I've hopefully attached an image below of what I'm trying to achieve. If there is already a page on this please share. Thank you!

enter image description here

Upvotes: 5

Views: 3100

Answers (2)

rsrjohnson
rsrjohnson

Reputation: 120

First, create your cycle column:

import pandas as pd

df = pd.DataFrame([
                    ["A",[2,3,5,2,3,6,8]],
                    ["B",[97,83,5]],
                    ["C",[2.5,6,7,9]]
                  ],
                   columns=["sample", "value"])


df["cycle"]=df["value"].str.len().apply(lambda x: list(range(0,x)))

Then you can simply explode both of the columns simultaneously. This option is available since the latest pandas version. (check the documentation here)

df.explode(["value","cycle"])

    sample  value   cycle
0        A      2       0
0        A      3       1
0        A      5       2
0        A      2       3
0        A      3       4
0        A      6       5
0        A      8       6
1        B     97       0
1        B     83       1
1        B      5       2
2        C    2.5       0
2        C      6       1
2        C      7       2
2        C      9       3

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35626

Use groupby cumcount after explode:

df = df.explode('value').reset_index(drop=True)
df['cycle'] = df.groupby('sample').cumcount()  # Enumerate Groups
df = df[['sample', 'cycle', 'value']]  # Reorder columns

or with insert:

df = df.explode('value').reset_index(drop=True)
df.insert(1, 'cycle', df.groupby('sample').cumcount())  # Insert New Column

or with assign:

df = (
    df.explode('value')
        .assign(cycle=lambda x: x.groupby(level=0).cumcount())
        .reset_index(drop=True)[['sample', 'cycle', 'value']]
)

df:

   sample  cycle value
0       A      0     2
1       A      1     3
2       A      2     5
3       A      3     2
4       A      4     3
5       A      5     6
6       A      6     8
7       B      0    97
8       B      1    83
9       B      2     5
10      C      0   2.5
11      C      1     6
12      C      2     7
13      C      3     9

Upvotes: 4

Related Questions