Reputation: 1727
I have a txt file in, theoretically, CoNLL format. Like this:
a O
nivel B-INDC
de O
la O
columna B-ANAT
anterior I-ANAT
del I-ANAT
acetabulo I-ANAT
existiendo O
minimos B-INDC
cambios B-INDC
edematosos B-DISO
en O
la O
medular B-ANAT
(...)
I need to convert it into a list of sentence, but I don't find a way to do it. I tried with the parser of conllu library:
from conllu import parse
sentences = parse("location/train_data.txt")
but they give the error: ParseException: Invalid line format, line must contain either tabs or two spaces.
How can I get this?
["a nivel de la columna anterior del acetabulo", "existiendo minimos cambios edematosos en la medular", ...]
Thanks
Upvotes: 1
Views: 2282
Reputation: 354
Simplest thing is to iterate over the lines of your file and then to retrieve the first column. No imports required.
result=[[]]
with open(YOUR_FILE,"r") as input:
for l in input:
if not l.startswith("#"):
if l.strip()=="":
if len(result[-1])>0:
result.append([])
else:
result[-1].append(l.split()[0])
result=[ " ".join(row) for row in result ]
In my experience, writing these from hand is the most effective way, because CoNLL formats are terribly diverse (but usually in trivial ways, such as order of columns) and you don't want to bother with other people's code for anything that can be so simply solved. The code quoted by @markusodenthal will, for example, maintain CoNLL comments (lines starting with #
) -- which may not be what you want.
The other thing is that writing the loop yourself allows you to process sentence by sentence rather than first reading everything into an array. If you don't need processing en bloc, this will be both faster and more scalable.
Upvotes: 1
Reputation: 845
You can use the conllu library.
Install using pip install conllu
.
A sample use-case is shown below.
>>> from conllu import parse
>>>
>>> data = """
# text = The quick brown fox jumps over the lazy dog.
1 The the DET DT Definite=Def|PronType=Art 4 det _ _
2 quick quick ADJ JJ Degree=Pos 4 amod _ _
3 brown brown ADJ JJ Degree=Pos 4 amod _ _
4 fox fox NOUN NN Number=Sing 5 nsubj _ _
5 jumps jump VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root _ _
6 over over ADP IN _ 9 case _ _
7 the the DET DT Definite=Def|PronType=Art 9 det _ _
8 lazy lazy ADJ JJ Degree=Pos 9 amod _ _
9 dog dog NOUN NN Number=Sing 5 nmod _ SpaceAfter=No
10 . . PUNCT . _ 5 punct _ _
"""
>>> sentences = parse(data)
>>> sentences
[TokenList<The, quick, brown, fox, jumps, over, the, lazy, dog, .>]
Upvotes: 0
Reputation: 1334
for NLP Problems, the first starting point is Huggingface - always for me - :D There is a nice example for your problem: https://huggingface.co/transformers/custom_datasets.html
Here they show a function that is exactly doing what you want:
from pathlib import Path
import re
def read_wnut(file_path):
file_path = Path(file_path)
raw_text = file_path.read_text().strip()
raw_docs = re.split(r'\n\t?\n', raw_text)
token_docs = []
tag_docs = []
for doc in raw_docs:
tokens = []
tags = []
for line in doc.split('\n'):
token, tag = line.split('\t')
tokens.append(token)
tags.append(tag)
token_docs.append(tokens)
tag_docs.append(tags)
return token_docs, tag_docs
texts, tags = read_wnut("location/train_data.txt")
Upvotes: 2