Reputation: 3447
I have a data frame like this
time text
0 1
1 2 r
2 4 e
3 6 d
4 7
5 8 b
6 9 a
7 12 g
8 15
import pandas as pd
import numpy as np
sample = pd.DataFrame({'time':[1,2,4,6,7,8,9,12,15],'text':[' ','r','e','d',' ','b','a','g','']})
And I want to concatenate the rows such that the difference in time between each space are captured, for 'red' it would be 7-1 (6) and for bag it would be 15-7 (8) with the final result look like this
joined_text time_difference
red 6
bag 8
After joining the text with groupby I couldn't seem to get the time difference across two groups
sample.loc[:,'group_id']=(sample['text']==' ').cumsum()
sample.loc[:,'joined_text'] = sample.groupby(['group_id'])['text'].transform(lambda x: ''.join(x))
Upvotes: 0
Views: 108
Reputation: 31236
groupby()
the above derived columnlambda
function to generate the text you wantThis does not match your result as you have used row 4 in boy red and bag
sample = pd.DataFrame({'time':[1,2,4,6,7,8,9,12,15],'text':[' ','r','e','d',' ','b','a','g','']})
df = sample
dfg = (df.assign(grp=np.where(df.text.shift().eq(" "), df.index, np.nan))
.assign(grp=lambda dfa: dfa.grp.fillna(method="ffill").fillna(method="bfill"))
)
time | text | grp | |
---|---|---|---|
0 | 1 | 1 | |
1 | 2 | r | 1 |
2 | 4 | e | 1 |
3 | 6 | d | 1 |
4 | 7 | 1 | |
5 | 8 | b | 5 |
6 | 9 | a | 5 |
7 | 12 | g | 5 |
8 | 15 | 5 |
dfr = (dfg.groupby("grp").agg(lambda x: "".join(x.text) + " " + str(list(x.time)[-1]-list(x.time)[0]))
.reset_index(drop=True).drop(columns="time")
)
text | |
---|---|
0 | red 6 |
1 | bag 7 |
Upvotes: 2
Reputation: 34086
You can do this:
In [845]: sample['new'] = sample.text.replace('\s+', np.nan, regex=True).replace('', np.nan)
In [853]: g = sample.new.isna().cumsum()
In [859]: ans = sample.groupby(g[g > 0])['text'].apply(''.join).reset_index(drop=True).to_frame('joined_text')
In [860]: ix = sample[sample.new.isna()].index
In [869]: tdiff = [sample.loc[ix[c + 1], 'time'] - sample.loc[ix[c], 'time'] for c,i in enumerate(ix) if c+1 < len(ix)]
In [881]: res = pd.concat([ans, pd.DataFrame({'time_difference': tdiff})], 1).dropna()
In [882]: res
Out[882]:
joined_text time_difference
0 red 6.0
1 bag 8.0
Upvotes: 1