Reputation: 137
I have a dataset as follow:
task = ['duty of care','informed consent','records management','conducting','experiments','positive reinforcement','developing','rapport' ]
start = [ 21, 24, 26, 60, 61, 80,98,99 ]
end = [ 24, 26, 28, 61, 62, 82,99,100]
dat = pd.DataFrame({'task': task, 'start':start, 'end': end})
dat
What I want to accomplish is this
dat.end[i] == dat.start[i+1]
then aggregate dat.task
dat.end[i] != dat.start[i+1]
do nothingThe desired output should look like this.
Upvotes: 2
Views: 242
Reputation: 22503
IIUC you can first create a new column using cumsum
and groupby
that column:
df = pd.DataFrame({'task': task, 'start':start, 'end': end})
df["count"] = (df["start"]-df["end"].shift(1)).fillna(0).ne(0).cumsum()
print (df.groupby("count").agg({"task":"first","start":"min","end":"max"}))
task start end
count
0 duty of care 21 28
1 conducting 60 62
2 positive reinforcement 80 82
3 developing 98 100
Upvotes: 1
Reputation: 86
We'll want a way to group these together if they're part of a block. First, let's find all the times where the start does not equal the next item's end. This series is True if an item is the first in a block and false otherwise.
temp = (dat['start'] != dat['end'].shift(1))
It looks like this:
0 True
1 False
2 False
3 True
4 False
5 True
6 True
7 False
We can then take a cumulative sum. When you do that to True/False, True will add 1 to the sum and False's won't change it. This is useful because it then gives us a grouping -- every block that you wanted to combine originally has it's own number associated with it.
groups = temp.cumsum()
looks like:
0 1
1 1
2 1
3 2
4 2
5 3
6 4
7 4
Now we're almost there. For every group, you want to grab the min start, the max end, and concatenate all the text.
datg = dat.groupby(groups)
out = pd.DataFrame({
'task': datg['task'].apply(' '.join),
'start': datg['start'].min(),
'end': datg['end'].max(),
})
That gives a final result of:
task start end
1 duty of care informed consent records management 21 28
2 conducting experiments 60 62
3 positive reinforcement 80 82
4 developing rapport 98 100
Putting it all in one place:
temp = (dat['start'] != dat['end'].shift(1))
groups = temp.cumsum()
datg = dat.groupby(groups)
out = pd.DataFrame({
'task': datg['task'].apply(' '.join),
'start': datg['start'].min(),
'end': datg['end'].max(),
})
Upvotes: 4