Reputation: 45
My goal is to take a dataframe composed of words and tags, and collapse it into a dataframe composed of sentences and a list of tags.
Sample input:
df = pd.DataFrame([('Effect', 'O'),
('of', 'O'),
('ginseng', 'i'),
('extract', 'i'),
('supplementation', 'i'),
('on', 'O'),
('testicular', 'o'),
('functions', 'o'),
('in', 'O'),
('diabetic', 'p'),
('rats', 'p'),
('.', 'p'),
('OBJECTIVE', 'O'),
('It', 'O'),
('was', 'O')],
columns=('token', 'annotation'))
Goal output:
df = pd.DataFrame([('Effect of ginseng extract supplementation on testicular functions in diabetic rats.', \
['O','O','i','i','i','O','o','o','O','p','p','p','O','O','O']),
('OBJECTIVE It was', ['O','O','O'])],
columns=('token', 'annotation'))
Sorry for the goofy example - that really is the first 15 rows of this dataset!!
Any ideas of how to compress the rows of words into rows of sentences would be much appreciated.
Upvotes: 1
Views: 36
Reputation: 30920
Use GroupBy.agg
:
new_df = (df.groupby(df['token'].eq('.').shift(fill_value=False).cumsum(),
as_index=False)
.agg({'token' : ' '.join, 'annotation': list}))
print(new_df)
token \
0 Effect of ginseng extract supplementation on t...
1 OBJECTIVE It was
annotation
0 [O, O, i, i, i, O, o, o, O, p, p, p]
1 [O, O, O]
If you don't want include the last point:
m = df['token'].eq('.')
new_df = (df.groupby(m.shift(fill_value=False).cumsum().loc[~m],as_index=False)
.agg({'token' : ' '.join, 'annotation': list}))
Upvotes: 1