santoku
santoku

Reputation: 3447

calculate difference in value between rows across group in python

I have a data frame like this

    time    text
0   1   
1   2   r
2   4   e
3   6   d
4   7   
5   8   b
6   9   a
7   12  g
8   15  


import pandas as pd
import numpy as np
sample = pd.DataFrame({'time':[1,2,4,6,7,8,9,12,15],'text':[' ','r','e','d',' ','b','a','g','']})

And I want to concatenate the rows such that the difference in time between each space are captured, for 'red' it would be 7-1 (6) and for bag it would be 15-7 (8) with the final result look like this

joined_text time_difference
red 6
bag 8

After joining the text with groupby I couldn't seem to get the time difference across two groups

sample.loc[:,'group_id']=(sample['text']==' ').cumsum()
sample.loc[:,'joined_text'] = sample.groupby(['group_id'])['text'].transform(lambda x: ''.join(x))

Upvotes: 0

Views: 108

Answers (2)

Rob Raymond
Rob Raymond

Reputation: 31236

  • create a column used for grouping rows - breaks every time a space is in text column
  • groupby() the above derived column
  • use a lambda function to generate the text you want

This does not match your result as you have used row 4 in boy red and bag

sample = pd.DataFrame({'time':[1,2,4,6,7,8,9,12,15],'text':[' ','r','e','d',' ','b','a','g','']})
df = sample

dfg = (df.assign(grp=np.where(df.text.shift().eq(" "), df.index, np.nan))
 .assign(grp=lambda dfa: dfa.grp.fillna(method="ffill").fillna(method="bfill"))
)

dfg

time text grp
0 1 1
1 2 r 1
2 4 e 1
3 6 d 1
4 7 1
5 8 b 5
6 9 a 5
7 12 g 5
8 15 5
dfr = (dfg.groupby("grp").agg(lambda x: "".join(x.text) + " " + str(list(x.time)[-1]-list(x.time)[0]))
 .reset_index(drop=True).drop(columns="time")
)

dfr

text
0 red 6
1 bag 7

Upvotes: 2

Mayank Porwal
Mayank Porwal

Reputation: 34086

You can do this:

In [845]: sample['new'] = sample.text.replace('\s+', np.nan, regex=True).replace('', np.nan)

In [853]: g = sample.new.isna().cumsum()
In [859]: ans = sample.groupby(g[g > 0])['text'].apply(''.join).reset_index(drop=True).to_frame('joined_text')

In [860]: ix = sample[sample.new.isna()].index

In [869]: tdiff = [sample.loc[ix[c + 1], 'time'] - sample.loc[ix[c], 'time'] for c,i in enumerate(ix) if c+1 < len(ix)]

In [881]: res = pd.concat([ans, pd.DataFrame({'time_difference': tdiff})], 1).dropna()

In [882]: res
Out[882]: 
  joined_text  time_difference
0         red              6.0
1         bag              8.0

Upvotes: 1

Related Questions