Einar
Einar

Reputation: 4933

Groupby consecutive occurrences of two column values in pandas

I have a pandas dataframe with this structure:

ID    loc    start    end    a_cn   b_cn
A     1      123      123    1      1
A     1      125      125    1      1
A     1      235      235    1      1
A     1      456      456    2      0
A     1      556      556    2      0
A     1      586      586    2      1
A     1      596      596    2      1
A     1      676      676    1      1

What I need to do is to be able to group consecutive rows where a_cn and b_cn are identical, but not if there is a "gap". In other words, the groups on the above data would be

Group 1

A     1      123      123    1      1
A     1      125      125    1      1
A     1      235      235    1      1

Group 2

A     1      456      456    2      0
A     1      556      556    2      0

Group 3

A     1      586      586    2      1
A     1      596      596    2      1

Group 4

A     1      676      676    1      1

groupby alone won't work, because it would group together the first 3 lines and the last one, which is not what I want (I need to do some calculations with start and end afterwards, so they must stay in order).

This question is similar to How to groupby with consecutive occurrence of duplicates in pandas but the solution there only works for a single column and I couldn't figure out how to make it work for 2 columns at the same time.

I've also tested

diff = (
    (df["a_cn"] != df["a_cn"].shift()) &
    (df["b_cn"] != df["b_cn"].shift())
).cumsum()

but it doesn't really do what I want because it trips on group 3, which has an identical a_cn compared to the previous group. As a non working example:

>>> diff

0    1
1    1
2    1
3    2
4    2
5    2
6    2
7    2

which is obviously incorrect.

Upvotes: 0

Views: 846

Answers (1)

jezrael
jezrael

Reputation: 862471

You can compare both columns with DataFrame.ne for != by shifted rows of both columns and then add DataFrame.any for test if True at least in one column, last added cumulative sum:

diff = df[["a_cn","b_cn"]].ne(df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
#alternative
diff = (df[["a_cn","b_cn"]] != df[["a_cn","b_cn"]].shift()).any(axis=1).cumsum()
print (diff)
0    1
1    1
2    1
3    2
4    2
5    3
6    3
7    4
dtype: int32

Your solution should be changed with | for bitwise OR:

diff = (
    (df["a_cn"] != df["a_cn"].shift()) |
    (df["b_cn"] != df["b_cn"].shift())
).cumsum()
print (diff)
0    1
1    1
2    1
3    2
4    2
5    3
6    3
7    4
dtype: int32

Upvotes: 3

Related Questions