Reputation: 1727

How to change from CoNLL format into a sentences list?

I have a txt file in, theoretically, CoNLL format. Like this:

a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT


existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)

I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:

from conllu import parse
sentences = parse("location/train_data.txt")

but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.

How can I get this?

["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]

Thanks

Upvotes: 1

Answers (3)

Chiarcos

Reputation: 354

Simplest thing is to iterate over the lines of your file and then to retrieve the first column. No imports required.

result=[[]]
with open(YOUR_FILE,"r") as input:
    for l in input:
        if not l.startswith("#"):
            if l.strip()=="":
                if len(result[-1])>0:
                    result.append([])
            else:
                result[-1].append(l.split()[0])
result=[ " ".join(row) for row in result ]

In my experience, writing these from hand is the most effective way, because CoNLL formats are terribly diverse (but usually in trivial ways, such as order of columns) and you don't want to bother with other people's code for anything that can be so simply solved. The code quoted by @markusodenthal will, for example, maintain CoNLL comments (lines starting with #) -- which may not be what you want.

The other thing is that writing the loop yourself allows you to process sentence by sentence rather than first reading everything into an array. If you don't need processing en bloc, this will be both faster and more scalable.

Upvotes: 1

AVISHEK GARAIN

Reputation: 845

You can use the conllu library.

Install using pip install conllu.

A sample use-case is shown below.

>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1   The     the    DET    DT   Definite=Def|PronType=Art   4   det     _   _
2   quick   quick  ADJ    JJ   Degree=Pos                  4   amod    _   _
3   brown   brown  ADJ    JJ   Degree=Pos                  4   amod    _   _
4   fox     fox    NOUN   NN   Number=Sing                 5   nsubj   _   _
5   jumps   jump   VERB   VBZ  Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    _   _
6   over    over   ADP    IN   _                           9   case    _   _
7   the     the    DET    DT   Definite=Def|PronType=Art   9   det     _   _
8   lazy    lazy   ADJ    JJ   Degree=Pos                  9   amod    _   _
9   dog     dog    NOUN   NN   Number=Sing                 5   nmod    _   SpaceAfter=No
10  .       .      PUNCT  .    _                           5   punct   _   _

"""
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]

Upvotes: 0

MarkusOdenthal

Reputation: 1334

for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html

Here they show a function that is exactly doing what you want:

from pathlib import Path
import re

def read_wnut(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text().strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split('\t')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)

    return token_docs, tag_docs

texts, tags = read_wnut("location/train_data.txt")

Upvotes: 2

How to change from CoNLL format into a sentences list?

Answers (3)

Related Questions