what is the NLTK equivalent of the UIMA CAS (common annotation structure)?

Question

In UIMA, the CAS (common annotating structure) plays a major role in structuring an NLP application. It allows to pass the metadata that one components adds into the next compoment. For example, sentence boundaries from a sentence tokenizer can be added to the CAS and used by the subsequent word tokenizer.

What is the equivalent data structure in NLTK?

zepp133 · Accepted Answer

In short, there is no equivalent concept to the CAS (Common Analysis System) in NLTK. The latter uses much simpler means of representing texts than does UIMA. In NLTK, texts are simply lists of words, whereas in UIMA you have very complex (and heavy-weight) data structures defined as part of the CAS for the purpose of describing the input data and its flow through a UIMA system.

That being said, I view the two of them to serve quite different purposes anyway. If I was to name a Java equivalent for NLTK, I would choose the OpenNLP toolkit rather than UIMA. The former offers a number of algorithms for NLP based on machine learning (as does NLTK, among other things), while the latter is a component-based framework not only for NLP, but unstructured data in general. That is, it defines a general model for building applications working with unstructured data.

what is the NLTK equivalent of the UIMA CAS (common annotation structure)?

Answers (1)

Related Questions