Reputation: 4933
I have a pandas dataframe with this structure:
ID loc start end a_cn b_cn
A 1 123 123 1 1
A 1 125 125 1 1
A 1 235 235 1 1
A 1 456 456 2 0
A 1 556 556 2 0
A 1 586 586 2 1
A 1 596 596 2 1
A 1 676 676 1 1
What I need to do is to be able to group consecutive rows where a_cn
and b_cn
are identical, but not if there is a "gap". In other words, the groups on the above data would be
Group 1
A 1 123 123 1 1
A 1 125 125 1 1
A 1 235 235 1 1
Group 2
A 1 456 456 2 0
A 1 556 556 2 0
Group 3
A 1 586 586 2 1
A 1 596 596 2 1
Group 4
A 1 676 676 1 1
groupby
alone won't work, because it would group together the first 3 lines and the last one, which is not what I want (I need to do some calculations with start
and end
afterwards, so they must stay in order).
This question is similar to How to groupby with consecutive occurrence of duplicates in pandas but the solution there only works for a single column and I couldn't figure out how to make it work for 2 columns at the same time.
I've also tested
diff = (
(df["a_cn"] != df["a_cn"].shift()) &
(df["b_cn"] != df["b_cn"].shift())
).cumsum()
but it doesn't really do what I want because it trips on group 3, which has an identical a_cn
compared to the previous group. As a non working example:
>>> diff
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
which is obviously incorrect.
Upvotes: 0
Views: 846
Reputation: 862471
You can compare both columns with DataFrame.ne
for !=
by shifted rows of both columns and then add DataFrame.any
for test if True at least in one column, last added cumulative sum:
diff = df[["a_cn","b_cn"]].ne(df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
#alternative
diff = (df[["a_cn","b_cn"]] != df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
print (diff)
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int32
Your solution should be changed with |
for bitwise OR
:
diff = (
(df["a_cn"] != df["a_cn"].shift()) |
(df["b_cn"] != df["b_cn"].shift())
).cumsum()
print (diff)
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int32
Upvotes: 3