Reputation: 510
I have a DF that has the results of a NER classifier such as the following:
df =
s token pred tokenID
17 hakawati B-Loc 3
17 theatre L-Loc 3
17 jerusalem U-Loc 7
56 university B-Org 5
56 of I-Org 5
56 texas I-Org 5
56 here L-Org 6
...
5402 dwight B-Peop 1
5402 d. I-Peop 1
5402 eisenhower L-Peop 1
There are many other columns in this DataFrame that are not relevant. Now I want to group the tokens depending on their sentenceID (=s) and their predicted tags to combine them into a single entity:
df2 =
s token pred
17 hakawati theatre Location
17 jerusalem Location
56 university of texas here Organisation
...
5402 dwight d. eisenhower People
Normally I would do so by simply using a line like
data_map = df.groupby(["s"],as_index=False, sort=False).agg(" ".join)
and using a rename function. However since the data contains different kind of Strings (B,I,L - Loc/Org ..) I don't know how to exactly do it.
Any ideas are appreciated.
Any ideas?
Upvotes: 1
Views: 67
Reputation: 164623
One solution via a helper column.
df['pred_cat'] = df['pred'].str.split('-').str[-1]
res = df.groupby(['s', 'pred_cat'])['token']\
.apply(' '.join).reset_index()
print(res)
s pred_cat token
0 17 Loc hakawati theatre jerusalem
1 56 Org university of texas here
2 5402 Peop dwight d. eisenhower
Note this doesn't match exactly your desired output; there seems to be some data-specific treatment involved.
Upvotes: 1
Reputation: 2117
You could group by both s
and tokenID
and aggregate like so:
def aggregate(df):
token = " ".join(df.token)
pred = df.iloc[0].pred.split("-", 1)[1]
return pd.Series({"token": token, "pred": pred})
df.groupby(["s", "tokenID"]).apply(aggregate)
# Output
token pred
s tokenID
17 3 hakawati theatre Loc
7 jerusalem Loc
56 5 university of texas Org
6 here Org
5402 1 dwight d. eisenhower Peop
Upvotes: 0