S.zhen
S.zhen

Reputation: 359

Create a column which increments based on another column in Python

I've currently switched my focus from R to Python. I work with data.table in R a lot, and I find it sometimes quite difficult to find an equivalent for some functions in Python.

I have a pandas data frame that looks like this:

df = pd.DataFrame({'A':['abc','def', 'def', 'abc', 'def', 'def','abc'],'B':[13123,45,1231,463,142131,4839, 4341]})

     A       B  
0  abc   13123    
1  def      45  
2  def    1231  
3  abc     463  
4  def  142131  
5  def    4839
6  abc    4341

I need to create a column that increments from 1 based on A and B, so that it indicates the increasing order of B. So I first create the sorted data frame, and the column I'm interested in creating is C as below:

    A       B   C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4

In R, using the library(data.table), this can be easily done in one line and creates a column within the original data table:

df[, C := 1:.N, by=A]

I've looked around and I think I might be able to make use of something like this:

df.groupby('A').size()
or
df['B'].argsort()

but not sure how to proceed from here, and how to join the new column back to the original data frame. It would be very helpful if anyone could give me any pointer.

Many thanks!

Upvotes: 6

Views: 3419

Answers (3)

meteore
meteore

Reputation: 4775

Index magic seems to be another way:

df['C']=df.sort(['A','B'],inplace=True).groupby('A').reset_index().index.labels[1]

Upvotes: 0

Wouter Overmeire
Wouter Overmeire

Reputation: 69136

In [61]: df
Out[61]:
     A       B
1  abc     463
6  abc    4341
0  abc   13123
3  def      45
2  def    1231
5  def    4839
4  def  142131

In [62]: df['C'] =  df.groupby('A')['A'].transform(lambda x: pd.Series(range(1, len(x)+1), index=x.index))

In [63]: df
Out[63]:
     A       B  C
1  abc     463  1
6  abc    4341  2
0  abc   13123  3
3  def      45  1
2  def    1231  2
5  def    4839  3
4  def  142131  4

Upvotes: 6

Matt Dowle
Matt Dowle

Reputation: 59612

And for comparison, the correct data.table syntax is :

df[, C := 1:.N, by=A]

This adds a new column C by reference to df. The := operator is part of the data.table package for R. It allows you to add and remove columns and assign to subsets of data.table, by group, by reference with no copy at all.

Upvotes: 1

Related Questions