john doe
john doe

Reputation: 2253

How to create a unique id column given an ordered numerical series?

I am working with large dataframe column that have the following structure:

In:

df1 = pd.DataFrame({'A': [0,0,0,1,2,0,1,0,1,2,3,4,5,6,7]})

Out:

    A
0   0
1   0
2   0
3   1
4   2
5   0
6   1
7   0
8   1
9   2
10  3
11  4
12  5
13  6
14  7

As you can see, A is an ordered sequence from 0 to n. That represents an order in my data. For example:

    A
2   0
3   1
4   2
5   0

Let's take a chunk of df1 (from index 2 to 5), in column A the 0 represents the beginning and the 2 represents the end of the sequence. On the other hand, if there is no consecutive number (e.g. 0) it represent that sequence ends. Thus, my question is how to generate efficiently a new column (e.g. id) conformed by a unique key or number based in the numerical order of A:

      A                             id
0     0 ->  begin and ends          1                

1     0 ->  begin and ends          2

2     0 ->  begin                   3
3     1 ->  continue                3
4     2 ->  ends                    3

5     0 ->  begin                   4
6     1 ->  ends                    4

7     0 ->  begin                   5
8     1 ->  continue                5
9     2 ->  continue                5
10    3 ->  continue                5
11    4 ->  continue                5
12    5 ->  continue                5
13    6 ->  continue                5
14    7 ->  ends                    5

I added a diagram in order to be more clear.

Upvotes: 2

Views: 86

Answers (1)

jezrael
jezrael

Reputation: 863166

I think you can use:

print ((df1.A.diff() < 1).cumsum() + 1)
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32

More general with dealing with NaN after diff:

dif = df1.A.diff()
dif.iloc[0] = df1.loc[0,'A']
print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32

Explanation:

First find differences by diff:

dif = df1.A.diff()
print (dif)
0     NaN
1     0.0
2     0.0
3     1.0
4     1.0
5    -2.0
6     1.0
7    -1.0
8     1.0
9     1.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
Name: A, dtype: float64

Then set first value (NaN) to original value:

dif.iloc[0] = df1.loc[0,'A']

Get mask:

print (dif < 1)
0      True
1      True
2      True
3     False
4     False
5      True
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: A, dtype: bool

Last use cumsum with boolean mask:

print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32

Upvotes: 5

Related Questions