Reputation: 65
I use Stanford NLP to extract data from text , knowing some code and work with this excellent library but some how , need to create my project in java that I know better than python , so found this code for use custom model :
import stanfordnlp
config = {
'processors': 'tokenize,mwt,pos,lemma,depparse', # Comma-separated list of processors to use
'lang': 'fa', # Language code for the language to build the Pipeline in
'tokenize_model_path': './PersianPT/fa_seraji_models/fa_seraji_tokenizer.pt', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}"
'mwt_model_path': './PersianPT/fa_seraji_models/fa_seraji_mwt_expander.pt',
'pos_model_path': './PersianPT/fa_seraji_models/fa_seraji_tagger.pt',
'pos_pretrain_path': './PersianPT/fa_seraji_models/fa_seraji.pretrain.pt',
'lemma_model_path': './PersianPT/fa_seraji_models/fa_seraji_lemmatizer.pt',
'depparse_model_path': './PersianPT/fa_seraji_models/fa_seraji_parser.pt',
'depparse_pretrain_path': './PersianPT/fa_seraji_models/fa_seraji.pretrain.pt'
}
nlp = stanfordnlp.Pipeline(**config) # Initialize the pipeline using a configuration dict
doc = nlp("من عاشقت هستم") # Run the pipeline on input text
with open('./Desktop/NLP/out.txt', "w", encoding="utf-8") as f:
for sen in doc.sentences[0]._tokens :
f.write(sen.words[0].text + '---Upos : ' +sen.words[0].upos + '---Xpos : ' +sen.words[0].xpos + '\n')
doc.sentences[0].print_tokens()
which work fine in python but when using java to Implement the code , don't know why the output not the same!
Java Code :
public class TextAnalyzer {
public static String text = """
در ابتدا، زندگینامهها به عنوان یک بخش از تاریخ با تمرکز بر یک فرد خاص، با اهمیت تاریخی در نظر گرفته شد. انواع مستقل زندگینامهنویسی با تمایز از تاریخ عمومی از قرن ۱۸ ام شروع شده و فرمهای معاصر آن به قرن بیستم میرسد.
""";
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
//props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
//props.setProperty("coref.algorithm", "neural");
//props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
//props.setProperty("processors", "tokenize,mwt,pos,lemma,depparse");
//props.setProperty("processors", "tokenize, mwt, lemma, pos, depparse");
props.setProperty("annotators", "tokenize, ssplit, parse");
//props.setProperty("lang", "fa");
//props.setProperty("use_gpu", "true");
props.setProperty("tokenize_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_tokenizer.pt");
props.setProperty("mwt_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_mwt_expander.pt");
props.setProperty("pos_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_tagger.pt");
props.setProperty("pos_pretrain_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji.pretrain.pt");
props.setProperty("lemma_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_lemmatizer.pt");
props.setProperty("depparse_model_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji_parser.pt");
props.setProperty("depparse_pretrain_path", BasicLocation.getBaseFileDirForNLPPersian() + File.separatorChar + "Seraji" + File.separatorChar +"fa_seraji.pretrain.pt");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
// CoreDocument document = new CoreDocument(text);
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// annnotate the document
//pipeline.annotate(document);
// examples
// 10th token of the document
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
// Get the parse tree for each sentence
Tree parseTree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
// Do something interesting with the parse tree!
System.out.println(parseTree);
}
}
}
I think It's maybe about processors or arg (.model_path) that not exist in java library so if some one know about it , Please share Thanks
Upvotes: 1
Views: 428
Reputation: 9450
unfortunately doing this isn't possible. While both CoreNLP (Java) and Stanza (formerly stanfordnlp, Python) do considerably overlapping things (part of speech, named entity recognition, parsing), their internals are completely different, dating from different decades. You cannot load Stanza models into CoreNLP, and there is at present no support for Persian in CoreNLP.
Upvotes: 2