Tokenize entities in dataframe

Question

I'm looking for a way to tokenize my entities directly in my dataframe with associated Tag.

my tokenization applies to elisions as for example :

["d'Angers"] => ["d'", "Angers"] 
["l'impératrice" ] => ["l'", "impératrice"]

Input dataframe :

Sentence  Mention  Tag
3   Vincennes   B-LOCATION
3   .   O

4   Confirmation    O
4   des O
4   privilèges  O
4   de  O
4   la  O
4   ville   O
4   d'Aire  O
4   1   O
4   ,   O
4   au  O
4   bailliage   B-ORGANISATION
4   d'Amiens    I-ORGANISATION
4 .

5 Projet O
5 de O
5 " O
5 tour O
5 de O
5 l'impératrice B-TITLE
5 Eugénie B-PERSON
5 .

6 session
6 à O
6 l'ONU B-ORGANISATION
6 du
6 17
6 mai
6 .

Expected output :

Sentence  Mention  Tag
3   Vincennes   B-LOCATION
3   .   O

4   Confirmation    O
4   des O
4   privilèges  O
4   de  O
4   la  O
4   ville   O
4   d'Aire  O
4   1   O
4   ,   O
4   au  O
4   bailliage   B-ORGANISATION
4   d' I-ORGANISATION
4   Amiens    I-ORGANISATION
4 . 

5 Projet O
5 de O
5 " O
5 tour O
5 de O
5 l' O
5 impératrice B-TITLE
5 Eugénie B-PERSON
5 .

6 session
6 à O
6 l' O
6 ONU B-ORGANISATION
6 du
6 17
6 mai
6 .

the difficulty is to be able to keep the label associated with the tokenized mention. If anyone has any leads, thank you in advance.

mozway · Accepted Answer

You could split using a lookbehind regex and explode:

(df.assign(Mention=df['Mention'].str.split("(?<=')"))
   .explode('Mention')
)

output:

    Sentence       Mention             Tag
0          3     Vincennes      B-LOCATION
1          3             .               O
2          4  Confirmation               O
3          4           des               O
4          4    privilèges               O
5          4            de               O
6          4            la               O
7          4         ville               O
8          4            d'               O
8          4          Aire               O
9          4             1               O
10         4             ,               O
11         4            au               O
12         4     bailliage  B-ORGANISATION
13         4            d'  I-ORGANISATION
13         4        Amiens  I-ORGANISATION
14         4             .            None
15         5        Projet               O
16         5            de               O
17         5             "               O
18         5          tour               O
19         5            de               O
20         5            l'         B-TITLE
20         5   impératrice         B-TITLE
21         5       Eugénie        B-PERSON
22         5             .            None
23         6       session            None
24         6             à               O
25         6            l'  B-ORGANISATION
25         6           ONU  B-ORGANISATION
26         6            du            None
27         6            17            None
28         6           mai            None
29         6             .            None

Tokenize entities in dataframe

Answers (1)

Related Questions