Samuel Shamiri
Samuel Shamiri

Reputation: 137

Aggregate (join) text column based on conditions in pandas

I have a dataset as follow:

task = ['duty of care','informed consent','records management','conducting','experiments','positive reinforcement','developing','rapport' ]
start = [ 21, 24, 26,  60, 61, 80,98,99 ]
end = [ 24, 26, 28,  61,  62, 82,99,100]

dat = pd.DataFrame({'task': task, 'start':start, 'end': end})
dat

enter image description here

What I want to accomplish is this

  1. if dat.end[i] == dat.start[i+1] then aggregate dat.task
  2. if dat.end[i] != dat.start[i+1] do nothing

The desired output should look like this.

enter image description here

Upvotes: 2

Views: 242

Answers (2)

Henry Yik
Henry Yik

Reputation: 22503

IIUC you can first create a new column using cumsum and groupby that column:

df = pd.DataFrame({'task': task, 'start':start, 'end': end})

df["count"] = (df["start"]-df["end"].shift(1)).fillna(0).ne(0).cumsum()
print (df.groupby("count").agg({"task":"first","start":"min","end":"max"}))

                         task  start  end
count                                    
0                duty of care     21   28
1                  conducting     60   62
2      positive reinforcement     80   82
3                  developing     98  100

Upvotes: 1

ryl
ryl

Reputation: 86

We'll want a way to group these together if they're part of a block. First, let's find all the times where the start does not equal the next item's end. This series is True if an item is the first in a block and false otherwise.

temp = (dat['start'] != dat['end'].shift(1))

It looks like this:

0     True
1    False
2    False
3     True
4    False
5     True
6     True
7    False

We can then take a cumulative sum. When you do that to True/False, True will add 1 to the sum and False's won't change it. This is useful because it then gives us a grouping -- every block that you wanted to combine originally has it's own number associated with it.

groups = temp.cumsum()

looks like:

0    1
1    1
2    1
3    2
4    2
5    3
6    4
7    4

Now we're almost there. For every group, you want to grab the min start, the max end, and concatenate all the text.

datg = dat.groupby(groups)
out = pd.DataFrame({
    'task':  datg['task'].apply(' '.join),
    'start': datg['start'].min(),
    'end':   datg['end'].max(),
})

That gives a final result of:

                                               task  start  end
1  duty of care informed consent records management     21   28
2                            conducting experiments     60   62
3                            positive reinforcement     80   82
4                                developing rapport     98  100

Putting it all in one place:

temp = (dat['start'] != dat['end'].shift(1))
groups = temp.cumsum()
datg = dat.groupby(groups)
out = pd.DataFrame({
    'task':  datg['task'].apply(' '.join),
    'start': datg['start'].min(),
    'end':   datg['end'].max(),
    })

Upvotes: 4

Related Questions