Bitswazsky
Bitswazsky

Reputation: 4698

How to create a new pandas column with increasing sequence id, but retain same value within each group

I have a pandas dataframe that looks like the one below:

df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
                 'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
                })

    hourOfDay   category
0   5           pageA
1   5           pageB
2   8           pageA
3   8           pageB
4   13          pageA
5   13          pageB

Now, what I want is to create a new column with a monotonically increasing id. This id should be having same value within a group (hourOfDay). I'm giving the example of the expected dataframe below.

    hourOfDay   category    index
0           5   pageA       1
1           5   pageB       1
2           8   pageA       2
3           8   pageB       2
4          13   pageA       3
5          13   pageB       3

For now, we can assume that the category column can have only two values for simplicity, but it can be extended later. If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it. I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the df.groupby("hourOfDay").cumcount() method and then finally merge the two dataframe. But this approach seems way too convoluted. So, I was wondering if there's a simpler way of achieving the same thing.

Upvotes: 2

Views: 713

Answers (3)

jezrael
jezrael

Reputation: 862691

If need same index per hourOfDay use GroupBy.ngroup:

df['index'] = df.groupby('hourOfDay', sort=True).ngroup() + 1

Or factorize:

df = df.sort_values('hourOfDay')
df['index'] = pd.factorize(df['hourOfDay'])[0] + 1

Upvotes: 2

Corralien
Corralien

Reputation: 120419

Use diff and cumsum:

df['index'] = df['hourOfDay'].diff().ne(0).cumsum()
print(df)

# Output:
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71580

Try:

>>> df['index'] = df['hourOfDay'].eq(df['hourOfDay'].shift(-1)).cumsum()
>>> df
  hourOfDay category  index
0         5    pageA      1
1         5    pageB      1
2         8    pageA      2
3         8    pageB      2
4        13    pageA      3
5        13    pageB      3
>>> 

Use eq and shift to determine whether the current value is the same as the previous value, then use cumsum to cumulatively sum up the Trues and Falses.

Upvotes: 1

Related Questions