Reputation: 252
I have extracted a pdf into a dataframe and would like to merge the rows if column B is the same speaker :
From :
Index Column B Column C
1 'I am going' Speaker A
2 'to the zoo' Speaker A
3 'I am going' Speaker B
4 'home ' Speaker B
5 'I am going' Speaker A
6 'to the park' Speaker A
To :
Index Column B Column C
1 'I am going to the zoo ' Speaker A
2 'I am going home' Speaker B
3 'I am going to the park' Speaker A
I tried using groupby but the order is important in the context of a pdf which is a speech.
Upvotes: 1
Views: 495
Reputation: 164623
You can use GroupBy
+ agg
after creating a series identifying when Column C changes:
res = df.assign(key=df['Column C'].ne(df['Column C'].shift()).cumsum())\
.groupby('key').agg({'Column C': 'first', 'Column B': ' '.join})\
.reset_index()
print(res)
key Column C Column B
0 1 Speaker A 'I am going' 'to the zoo'
1 2 Speaker B 'I am going' 'home '
2 3 Speaker A 'I am going' 'to the park'
Note the output has quotation marks, as per the input you have supplied. These won't show if the strings are defined without quotes.
Upvotes: 3
Reputation: 1116
Use groupby
and agg
, as follows:
import pandas as pd
from functools import reduce
data = {'col1': [1,1,2,2,3], 'col2': ['foo', 'bar', 'baz', 'bag', 'bat']}
df = pd.DataFrame(data)
print(df)
aggregated = df.groupby('col1').agg(lambda x: reduce(lambda s1, s2: s1 + s2, x))
print(aggregated)
Will produce the following output:
col1 col2
0 1 foo
1 1 bar
2 2 baz
3 2 bag
4 3 bat
col2
col1
1 foobar
2 bazbag
3 bat
Upvotes: 0