Is there a way to detect the change of a categorical variable in Pandas?

Question

I would like to somehow iterate over each row in the below Pandas dataset and somehow transform it in a way that it's more compact.

1) I somehow need to detect the changes in the Zone row and then collapse all the rows with the same Zone into a single row.

2) I also need to calculate the time difference of the last and the first Zone of the same type. There I guess I can use diff() function but I am not sure how should I combine these things.

Important note: The dataset is big. So the Zone values are not just one-time occurrences. In the below example, there will be rows with Zone = 'Between Space' again.

Example dataframe:

    Time    Centre position X   Centre position Y   Datafile    Group   Zone    Timeframe
0   0.000   NaN NaN CHR1    CHR Between Space   Before stimulation
1   0.920   249.0   219.0   CHR1    CHR Between Space   Before stimulation
2   0.984   249.0   217.0   CHR1    CHR Between Space   Before stimulation
3   1.112   250.0   216.0   CHR1    CHR Between Space   Before stimulation
4   1.256   252.0   215.0   CHR1    CHR Between Space   Before stimulation
5   1.384   253.0   217.0   CHR1    CHR Between Space   Before stimulation
6   1.512   256.0   222.0   CHR1    CHR Centre  Before stimulation
7   1.576   260.0   222.0   CHR1    CHR Centre  Before stimulation
8   1.720   271.0   221.0   CHR1    CHR Centre  Before stimulation
9   1.848   277.0   219.0   CHR1    CHR Centre  Before stimulation
10  1.976   288.0   216.0   CHR1    CHR Centre  Before stimulation

Expected output:

    Time    Datafile    Group   Zone    Timeframe
0   1,384   CHR1    CHR Between Space   Before stimulation
1   0,592   CHR1    CHR Centre  Before stimulation

Thank you very much in advance for your help!

filbranden · Accepted Answer

So it looks like groupby() doesn't work directly for you, since it would group all the separate groups and you want to keep them separate.

Instead, it seems what you want is to find groups of contiguous rows with the same "Zone" value, correct?

And you're only keeping the last entry of each block, correct?

If so, you can use this expression to only keep the last row of each group which shares the same "Zone":

df = df[df.Zone != df.Zone.shift(-1)]

The expression df.Zone.shift(-1) will shift elements up one row. By comparing with the unshifted elements, you're able to detect where changes happen (the rows where the next element will differ from the current one.)

Using that expression to select rows from the DataFrame will return the last of each block.

The last line of the DataFrame will also be returned, as it should, since it's the end of that block. It's returning because shifting the elements up leaves a NaN at the end, which is what makes it work.

For the second part, it seems you want to update "Time" so you keep the diff from the previous block, correct? Indeed using diff() is an option, but one disadvantage of that approach is that you end up with a NaN in the first row. You can use shift() again to overcome that issue, by asking to use 0.0 instead of NaN for the shifted cell.

df.loc[:, 'Time'] -= df.Time.shift(fill_value=0.0)

Finally, it seems you're dropping the two "Centre position" columns, which you can do with:

df = df.drop(['Centre position X', 'Centre position Y'], axis=1)

These changes seem to match your expected output.

Is there a way to detect the change of a categorical variable in Pandas?

Answers (2)

Related Questions