FarHooD
FarHooD

Reputation: 65

Custom Model In Stanford NLP (CoreNLP) For Persian

I use Stanford NLP to extract data from text , knowing some code and work with this excellent library but some how , need to create my project in java that I know better than python , so found this code for use custom model :

import stanfordnlp

config = {
    'processors': 'tokenize,mwt,pos,lemma,depparse', # Comma-separated list of processors to use
    'lang': 'fa', # Language code for the language to build the Pipeline in
    'tokenize_model_path': './PersianPT/fa_seraji_models/fa_seraji_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
    'mwt_model_path': './PersianPT/fa_seraji_models/fa_seraji_mwt_expander.pt',
    'pos_model_path': './PersianPT/fa_seraji_models/fa_seraji_tagger.pt',
    'pos_pretrain_path': './PersianPT/fa_seraji_models/fa_seraji.pretrain.pt',
    'lemma_model_path': './PersianPT/fa_seraji_models/fa_seraji_lemmatizer.pt',
    'depparse_model_path': './PersianPT/fa_seraji_models/fa_seraji_parser.pt',
    'depparse_pretrain_path': './PersianPT/fa_seraji_models/fa_seraji.pretrain.pt'
}
nlp = stanfordnlp.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("من عاشقت هستم") # Run the pipeline on input text

with open('./Desktop/NLP/out.txt', "w", encoding="utf-8") as f:
    for sen in doc.sentences[0]._tokens :
        f.write(sen.words[0].text + '---Upos : ' +sen.words[0].upos + '---Xpos : ' +sen.words[0].xpos + '\n')
doc.sentences[0].print_tokens()

which work fine in python but when using java to Implement the code , don't know why the output not the same!

Java Code :

public class TextAnalyzer {
    public static String text = """
            در ابتدا، زندگی‌نامه‌ها به عنوان یک بخش از تاریخ با تمرکز بر یک فرد خاص، با اهمیت تاریخی در نظر گرفته شد. انواع مستقل زندگی‌نامه‌نویسی با تمایز از تاریخ عمومی از قرن ۱۸ ام شروع شده و فرم‌های معاصر آن به قرن بیستم می‌رسد.
            """;

      public static void main(String[] args) {
        // set up pipeline properties
        Properties props = new Properties();
        // set the list of annotators to run
        //props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
        //props.setProperty("coref.algorithm", "neural");
        //props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
        //props.setProperty("processors", "tokenize,mwt,pos,lemma,depparse");
        
        //props.setProperty("processors", "tokenize, mwt, lemma, pos, depparse");
        props.setProperty("annotators", "tokenize, ssplit, parse");
        
        //props.setProperty("lang", "fa");
        //props.setProperty("use_gpu", "true");
        props.setProperty("tokenize_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_tokenizer.pt");
        props.setProperty("mwt_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_mwt_expander.pt");
        props.setProperty("pos_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_tagger.pt");
        props.setProperty("pos_pretrain_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji.pretrain.pt");
        props.setProperty("lemma_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_lemmatizer.pt");
        props.setProperty("depparse_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_parser.pt");
        props.setProperty("depparse_pretrain_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji.pretrain.pt");
        
        
        // build pipeline
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // create a document object
       // CoreDocument document = new CoreDocument(text);
        
        Annotation document = new Annotation(text);
        // run all Annotators on this text
        pipeline.annotate(document);
        
        // annnotate the document
        //pipeline.annotate(document);
        // examples

        // 10th token of the document
        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
          // Get the parse tree for each sentence
          Tree parseTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
          // Do something interesting with the parse tree!
          System.out.println(parseTree);
        }
       




      }

    }

I think It's maybe about processors or arg (.model_path) that not exist in java library so if some one know about it , Please share Thanks

Upvotes: 1

Views: 428

Answers (1)

Christopher Manning
Christopher Manning

Reputation: 9450

unfortunately doing this isn't possible. While both CoreNLP (Java) and Stanza (formerly stanfordnlp, Python) do considerably overlapping things (part of speech, named entity recognition, parsing), their internals are completely different, dating from different decades. You cannot load Stanza models into CoreNLP, and there is at present no support for Persian in CoreNLP.

Upvotes: 2

Related Questions