Reputation: 29
I would like to somehow iterate over each row in the below Pandas dataset and somehow transform it in a way that it's more compact.
1) I somehow need to detect the changes in the Zone
row and then collapse all the rows with the same Zone
into a single row.
2) I also need to calculate the time difference of the last and the first Zone
of the same type. There I guess I can use diff()
function but I am not sure how should I combine these things.
Important note: The dataset is big. So the Zone
values are not just one-time occurrences. In the below example, there will be rows with Zone = 'Between Space'
again.
Example dataframe:
Time Centre position X Centre position Y Datafile Group Zone Timeframe
0 0.000 NaN NaN CHR1 CHR Between Space Before stimulation
1 0.920 249.0 219.0 CHR1 CHR Between Space Before stimulation
2 0.984 249.0 217.0 CHR1 CHR Between Space Before stimulation
3 1.112 250.0 216.0 CHR1 CHR Between Space Before stimulation
4 1.256 252.0 215.0 CHR1 CHR Between Space Before stimulation
5 1.384 253.0 217.0 CHR1 CHR Between Space Before stimulation
6 1.512 256.0 222.0 CHR1 CHR Centre Before stimulation
7 1.576 260.0 222.0 CHR1 CHR Centre Before stimulation
8 1.720 271.0 221.0 CHR1 CHR Centre Before stimulation
9 1.848 277.0 219.0 CHR1 CHR Centre Before stimulation
10 1.976 288.0 216.0 CHR1 CHR Centre Before stimulation
Expected output:
Time Datafile Group Zone Timeframe
0 1,384 CHR1 CHR Between Space Before stimulation
1 0,592 CHR1 CHR Centre Before stimulation
Thank you very much in advance for your help!
Upvotes: 1
Views: 1325
Reputation: 8898
So it looks like groupby()
doesn't work directly for you, since it would group all the separate groups and you want to keep them separate.
Instead, it seems what you want is to find groups of contiguous rows with the same "Zone" value, correct?
And you're only keeping the last entry of each block, correct?
If so, you can use this expression to only keep the last row of each group which shares the same "Zone":
df = df[df.Zone != df.Zone.shift(-1)]
The expression df.Zone.shift(-1)
will shift elements up one row. By comparing with the unshifted elements, you're able to detect where changes happen (the rows where the next element will differ from the current one.)
Using that expression to select rows from the DataFrame will return the last of each block.
The last line of the DataFrame will also be returned, as it should, since it's the end of that block. It's returning because shifting the elements up leaves a NaN at the end, which is what makes it work.
For the second part, it seems you want to update "Time" so you keep the diff from the previous block, correct? Indeed using diff()
is an option, but one disadvantage of that approach is that you end up with a NaN in the first row. You can use shift()
again to overcome that issue, by asking to use 0.0 instead of NaN for the shifted cell.
df.loc[:, 'Time'] -= df.Time.shift(fill_value=0.0)
Finally, it seems you're dropping the two "Centre position" columns, which you can do with:
df = df.drop(['Centre position X', 'Centre position Y'], axis=1)
These changes seem to match your expected output.
Upvotes: 1
Reputation: 660
Hope this does what you want:
df_new = df.groupby(df['Zone']).aggregate((("Time", "sum"), ("Datafile", "min"), ("Group", "min"), ("Timeframe", "min")))
.groupby(df['Zone'])
means you compare the DataFrame by "Time"
.aggregate((("Time", "sum"), ...)))
means you add up the Times
.aggregate(((..., ("Datafile", "min"), ("Group", "min"), ("Timeframe", "min")))
leaves the other parameters as they are
Upvotes: 0