NLP text annotation storage and access

Question

I have a large corpus of text (10 million sentences or so) which I'd like to preprocess with various NLP tools (POS tagger, Syntax parser, Dependency Parser, etc). I need to store the various annotation layers created by these tools somehow, and access them on-the-fly from within my Java code (perhaps by providing the start and end index of the text span in the corpus, and the type of annotation).

Does a software system already exist to store and access these annotations quickly? If not, what would be the best way to store and access these annotations? Speed of access would be most important.

Himanshu Gahlot · Accepted Answer

You can look at UIMA. Though it is not a storage engine but it provides the platform for converting your unstructured text data to a more structured format by applying various annotators (which can be tokenizers, sentence splitters, POS Taggers) in a pipeline. The output contains the annotations with start and end indices in the document (you can convert the output to xml format). So, you can divide your corpus into several documents, pass them through UIMA pipelines, and store the output in a document based store such as MongoDB. I think accessing these annotations on a document level makes more sense because the context of annotations matters. So, you can retrieve these annotated documents from the MongoDB store and access the annotations using the start and end indices or the type of annotation (token, sentence, etc.).

NLP text annotation storage and access

Answers (1)

Related Questions