Parsing Italian CONLLU files to remove lemmas

Question

I am working with Italian Universal Dependency data in CONLLU format, like this:

sent_id = VIT-4006
text = "grazie dell'informazione, la metterò nella memoria del mio Macintosh".
1 "   "   PUNCT   FB  _   2   punct   _   SpaceAfter=No
2 grazie  grazie  NOUN    S   _   0   root    _   _
3-4   dell'   _   _   _   _   _   _   _   SpaceAfter=No
3 di  di  ADP E   _   5   case    _   _
4 l'  il  DET RD  Definite=Def|Number=Sing|PronType=Art   5   det _   _
5 informazione    informazione    NOUN    S   Gender=Fem|Number=Sing  2   nmod    _   SpaceAfter=No
6 ,   ,   PUNCT   FF  _   2   punct   _   _
7 la  la  PRON    PC  Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 8   obj _   _
8 metterò mettere VERB    V   Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin    2   parataxis   _   _
9-10  nella   _   _   _   _   _   _   _   _
9 in  in  ADP E   _   11  case    _   _
10    la  il  DET RD  Definite=Def|Gender=Fem|Number=Sing|PronType=Art    11  det _   _
11    memoria memoria NOUN    S   Gender=Fem|Number=Sing  8   obl _   _
12-13 del _   _   _   _   _   _   _   _
12    di  di  ADP E   _   15  case    _   _
13    il  il  DET RD  Definite=Def|Gender=Masc|Number=Sing|PronType=Art   15  det _   _
14    mio mio DET AP  Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs   15  det:poss    _   _
15    Macintosh   Macintosh   PROPN   SP  _   11  nmod    _   SpaceAfter=No
16    "   "   PUNCT   FB  _   2   punct   _   SpaceAfter=No
17    .   .   PUNCT   FS  _   2   punct   _   _

In this example, I want to remove the lines 3, 4, 9, 10, 12, 13 since they are the component pieces of the lines immediately above them (3-4, 9-10, 12-13).

I want my output to look like:

sent_id = VIT-4006
text = "grazie dell'informazione, la metterò nella memoria del mio Macintosh".
1 "   "   PUNCT   FB  _   2   punct   _   SpaceAfter=No
2 grazie  grazie  NOUN    S   _   0   root    _   _
3-4   dell'   _   _   _   _   _   _   _   SpaceAfter=No
5 informazione    informazione    NOUN    S   Gender=Fem|Number=Sing  2   nmod    _   SpaceAfter=No
6 ,   ,   PUNCT   FF  _   2   punct   _   _
7 la  la  PRON    PC  Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs 8   obj _   _

...```

thanks

The conllu library TokenList object includes every word, like "grazie, dell', di, l', informazione..." so I am not able to use it.

Parsing Italian CONLLU files to remove lemmas

Answers (0)

Related Questions