Reputation: 1554
from pyconll import load_from_file
data = load_from_file("filename.conll")
data
I'm following the documentation of pyconll to read in a .conll
file, yet the following error occurs and I don't understand what it means. The dataset should be readable since it's kinda a benchmark dataset. And I don't see any other parameters of pyconll.load_from_file()
that can be specified in the documentation. Can anybody help me out here?
Plus, is there a way to read .conll
file by the nltk
package?
ParseError Traceback (most recent call last)
<ipython-input-14-06859f7ce8b2> in <module>()
----> 1 data = load_from_file("filename.conll")
2 data
5 frames
/usr/local/lib/python3.6/dist-packages/pyconll/unit/token.py in __init__(self, source, empty)
661 error_msg = 'The number of columns per token line must be 10. Invalid token: {}'.format(
662 source)
--> 663 raise ParseError(error_msg)
664
665 # Assign all the field values from the line to internal equivalents.
ParseError: The number of columns per token line must be 10. Invalid token: @paulwalk O
Upvotes: 2
Views: 3510
Reputation: 316
I am the creator of pyconll. Obviously this answer is far past the original question, but in general the library works only with conllu format files. The name is pyconll as it aspirationally will support more conll versions in the future, but currently only conllu format is supported.
It is also a bit strict in terms of what is accepted. All files from the UD project work and are tested against, but it is possible a manually created file or from a different source does not follow the spec set out by UD.
Upvotes: 0
Reputation: 804
I have encountered the same problem. I fixed it by switching the pyhton library to conllu https://pypi.org/project/conllu/ . Now I can read and parse all my CoNLL-U files without any problem. I think https://stackoverflow.com/a/66563362/7924573 explained the reason why.
Upvotes: 0
Reputation: 354
The problem is that "CoNLL" formats differ in the number, order and content of columns. According to the error, your parser seems to expect CoNLL-U (https://universaldependencies.org/format.html) or CoNLL-X (original website down). Whatever your input it, the error claims to not have found the expected number of columns (10), see What is CoNLL data format?.
However, if you do have 10 columns, try to escape the expected token, maybe some internal regex replacement failed.
As for the sub-question on parsing CoNLL with NLTK, see the details in the question(!) Parsing CoNLL-U files with NLTK. Parsing some CoNLL formats is possible, but NLTK doesn't seem to support more recent CoNLL formats, in particular not CoNLL-X and CoNLL-U. It might work on your data (as this is neither CoNLL-X nor CoNLL-U).
Upvotes: 3