Paw in Data
Paw in Data

Reputation: 1554

Why can't I read in .conll file with Python (confusing parse-error)?

from pyconll import load_from_file

data = load_from_file("filename.conll")
data

I'm following the documentation of pyconll to read in a .conll file, yet the following error occurs and I don't understand what it means. The dataset should be readable since it's kinda a benchmark dataset. And I don't see any other parameters of pyconll.load_from_file() that can be specified in the documentation. Can anybody help me out here?

Plus, is there a way to read .conll file by the nltk package?

ParseError                                Traceback (most recent call last)
<ipython-input-14-06859f7ce8b2> in <module>()
----> 1 data = load_from_file("filename.conll")
      2 data

5 frames
/usr/local/lib/python3.6/dist-packages/pyconll/unit/token.py in __init__(self, source, empty)
    661             error_msg = 'The number of columns per token line must be 10. Invalid token: {}'.format(
    662                 source)
--> 663             raise ParseError(error_msg)
    664 
    665         # Assign all the field values from the line to internal equivalents.

ParseError: The number of columns per token line must be 10. Invalid token: @paulwalk   O

Upvotes: 2

Views: 3510

Answers (3)

Matias Grioni
Matias Grioni

Reputation: 316

I am the creator of pyconll. Obviously this answer is far past the original question, but in general the library works only with conllu format files. The name is pyconll as it aspirationally will support more conll versions in the future, but currently only conllu format is supported.

It is also a bit strict in terms of what is accepted. All files from the UD project work and are tested against, but it is possible a manually created file or from a different source does not follow the spec set out by UD.

Upvotes: 0

tschomacker
tschomacker

Reputation: 804

I have encountered the same problem. I fixed it by switching the pyhton library to conllu https://pypi.org/project/conllu/ . Now I can read and parse all my CoNLL-U files without any problem. I think https://stackoverflow.com/a/66563362/7924573 explained the reason why.

Upvotes: 0

Chiarcos
Chiarcos

Reputation: 354

The problem is that "CoNLL" formats differ in the number, order and content of columns. According to the error, your parser seems to expect CoNLL-U (https://universaldependencies.org/format.html) or CoNLL-X (original website down). Whatever your input it, the error claims to not have found the expected number of columns (10), see What is CoNLL data format?.

However, if you do have 10 columns, try to escape the expected token, maybe some internal regex replacement failed.

As for the sub-question on parsing CoNLL with NLTK, see the details in the question(!) Parsing CoNLL-U files with NLTK. Parsing some CoNLL formats is possible, but NLTK doesn't seem to support more recent CoNLL formats, in particular not CoNLL-X and CoNLL-U. It might work on your data (as this is neither CoNLL-X nor CoNLL-U).

Upvotes: 3

Related Questions