Kshitij
Kshitij

Reputation: 144

Using UIMA, Stanford Core NLP together

The UIMA and the StanfordNLP produces the output after the pipeline of operation like if We want to do POS tagging so in the input text first the tokenization is done and then the POS tagging.

I want to use the tokenization of the UIMA and use that token in the POS tagger of the Stanford CoreNLP. But the POS tagger of Stanford CoreNLP have the requirement to run the tokenizer before POS tagger.

So, is it possible to use the different API in the same pipeline or not ? Is it possible to use the UIMA tokenizer and the Stanford CoreNLP together ?

Upvotes: 2

Views: 2236

Answers (2)

rec
rec

Reputation: 10895

The typical approach to combine analysis steps from different tool chains (e.g. OpenNLP, Stanford CoreNLP, etc.) in UIMA would be to wrap each of them as a UIMA analysis engine. The analysis engine serves as an adapter between the UIMA data structure (the CAS) and the data structures used be the individual tools (e.g. the OpenNLP POS tagger or the CoreNLP parser). At the level of UIMA, these components can then be combined into pipelines.

There are various collections of UIMA components that wrap such tool chains, e.g. ClearTK, DKPro Core, or U-Compare.

The following example combines the OpenNLP segmenter (tokenizer/sentence splitter) and Stanford CoreNLP parser (which also creates the POS tags in the present example). The example is implemented as a Groovy script employing the uimaFIT API to create and run a pipeline using from components of the DKPro Core collection.

#!/usr/bin/env groovy
@Grab(group='de.tudarmstadt.ukp.dkpro.core', 
      module='de.tudarmstadt.ukp.dkpro.core.opennlp-asl', 
      version='1.5.0')
@Grab(group='de.tudarmstadt.ukp.dkpro.core', 
      module='de.tudarmstadt.ukp.dkpro.core.stanfordnlp-gpl', 
      version='1.5.0')

import static org.apache.uima.fit.pipeline.SimplePipeline.*;
import static org.apache.uima.fit.util.JCasUtil.*;
import static org.apache.uima.fit.factory.AnalysisEngineFactory.*;
import org.apache.uima.fit.factory.JCasFactory;

import de.tudarmstadt.ukp.dkpro.core.opennlp.*;
import de.tudarmstadt.ukp.dkpro.core.stanfordnlp.*;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.*;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.*;

def jcas = JCasFactory.createJCas();
jcas.documentText = "This is a test";
jcas.documentLanguage = "en";

runPipeline(jcas,
  createEngineDescription(OpenNlpSegmenter),
  createEngineDescription(StanfordParser,
    StanfordParser.PARAM_WRITE_PENN_TREE, true));

select(jcas, Token).each { println "${it.coveredText} ${it.pos.posValue}" }

select(jcas, PennTree).each { println it.pennTree }

Its output (after a lot of logging output) should look like this:

This DT
is VBZ
a DT
test NN
(ROOT
  (S
    (NP (DT This))
    (VP (VBZ is)
      (NP (DT a) (NN test)))))

I gave the Groovy script as an example because it works out of the box. A Java program would look quite similar, but would typically use e.g. Maven or Ivy to obtain the required libraries.

In case you want to try the script and need more information on installing Groovy and on potential trouble-shooting, you can find more information here.

Disclosure: I am working on the DKPro Core and Apache UIMA uimaFIT projects.

Upvotes: 5

niloc
niloc

Reputation: 41

There are at least two ways to handle this if you want to use CoreNLP as the pipeline.

  1. Force CoreNLP to ignore the requirements.

    Properties props = new Properties();
    props.put("enforceRequirements", "false");
    props.put("annotators", "pos");
    

    This will get rid of the "missing requirements" error. However, the POSTaggerAnnotator in CoreNLP expects the tokens to be CoreLabel objects and expects the sentences to be CoreMap objects (instantiated as ArrayCoreMap) so you'll have to convert them.

  2. Add custom annotators to the pipeline.

    The CoreMaps/CoreLabels are maps with classes as keys so you'll need a class/key for your custom annotation:

    public class CustomAnnotations {        
    
        //this class will act as a key
        public static class UIMATokensAnnotation 
                implements CoreAnnotation<List<CoreLabel>> {        
    
            //getType() defines/restricts the Type of the value associated with this key
            public Class<List<CoreLabel>> getType() {
                return ErasureUtils.<Class<List<CoreLabel>>> uncheckedCast(List.class);
            }
        }  
    }
    

    You will also need an annotator class:

    public class UIMATokensAnnotator implements Annotator{
    
        //this constructor signature is expected by StanfordCoreNLP.class
        public UIMATokensAnnotator(String name, Properties props) {
            //initialize whatever you need
        }
    
        @Override
        public void annotate(Annotation annotation) {
            List<CoreLabel> tokens = //run the UIMA tokenization and convert output to CoreLabels   
            annotation.set(CustomAnnotations.UIMATokensAnnotation.class, tokens);
        }
    
        @Override
        public Set<Requirement> requirementsSatisfied() {
            return Collections.singleton(TOKENIZE_REQUIREMENT);
        }
    
        @Override
        public Set<Requirement> requires() {
            return Collections.emptySet();
        }
    
    }
    

    finally:

    props.put("customAnnotatorClass.UIMAtokenize", "UIMATokensAnnotator")
    props.put("annotators", "UIMAtokenize, ssplit, pos")
    

    The UIMA/OpenNLP/etc. sentence annotation can be added as a custom annotator in a similar fashion. Check out http://nlp.stanford.edu/software/corenlp-faq.shtml#custom for a condensed version of option #2.

Upvotes: 4

Related Questions