Reputation: 2253
I am working with large dataframe column that have the following structure:
In:
df1 = pd.DataFrame({'A': [0,0,0,1,2,0,1,0,1,2,3,4,5,6,7]})
Out:
A
0 0
1 0
2 0
3 1
4 2
5 0
6 1
7 0
8 1
9 2
10 3
11 4
12 5
13 6
14 7
As you can see, A
is an ordered sequence from 0 to n. That represents an order in my data. For example:
A
2 0
3 1
4 2
5 0
Let's take a chunk of df1
(from index 2
to 5
), in column A
the 0
represents the beginning and the 2
represents the end of the sequence. On the other hand, if there is no consecutive number (e.g. 0
) it represent that sequence ends. Thus, my question is how to generate efficiently a new column (e.g. id
) conformed by a unique key or number based in the numerical order of A
:
A id
0 0 -> begin and ends 1
1 0 -> begin and ends 2
2 0 -> begin 3
3 1 -> continue 3
4 2 -> ends 3
5 0 -> begin 4
6 1 -> ends 4
7 0 -> begin 5
8 1 -> continue 5
9 2 -> continue 5
10 3 -> continue 5
11 4 -> continue 5
12 5 -> continue 5
13 6 -> continue 5
14 7 -> ends 5
I added a diagram in order to be more clear.
Upvotes: 2
Views: 86
Reputation: 863166
I think you can use:
print ((df1.A.diff() < 1).cumsum() + 1)
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
8 5
9 5
10 5
11 5
12 5
13 5
14 5
Name: A, dtype: int32
More general with dealing with NaN
after diff
:
dif = df1.A.diff()
dif.iloc[0] = df1.loc[0,'A']
print ((dif < 1).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
8 5
9 5
10 5
11 5
12 5
13 5
14 5
Name: A, dtype: int32
Explanation:
First find differences by diff
:
dif = df1.A.diff()
print (dif)
0 NaN
1 0.0
2 0.0
3 1.0
4 1.0
5 -2.0
6 1.0
7 -1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 1.0
13 1.0
14 1.0
Name: A, dtype: float64
Then set first value (NaN
) to original value:
dif.iloc[0] = df1.loc[0,'A']
Get mask:
print (dif < 1)
0 True
1 True
2 True
3 False
4 False
5 True
6 False
7 True
8 False
9 False
10 False
11 False
12 False
13 False
14 False
Name: A, dtype: bool
Last use cumsum
with boolean mask:
print ((dif < 1).cumsum())
0 1
1 2
2 3
3 3
4 3
5 4
6 4
7 5
8 5
9 5
10 5
11 5
12 5
13 5
14 5
Name: A, dtype: int32
Upvotes: 5