Reputation: 122450
I'm using the Stanford NLP Parsing toolkit. Given a word in the lexicon, how can I find its frequency*? Or, given a frequency rank, how can I determine the corresponding word?
*in the entire language, not just the text sample.
This is a demo of the toolkit I'm using:
class ParserDemo {
public static void main(String[] args) {
LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
String[] sent = { "Sincerity", "may", "frighten", "the", "boy", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}
}
Upvotes: 2
Views: 3560
Reputation: 859
This is an IR (information retrieval) problem more than NLP. One should look at libraries like Lucene for this task.
Upvotes: 0
Reputation: 15931
If you are only counting word frequencies, sentence parsing is unnecessary. All you need to do is tokenise the input and then count word frequencies using a java HashMap
. If you want to use the Stanford tools, then use any of the tokenisers in edu.stanford.nlp.process
.
This gives you the frequency of any given word, but in general it may not be possible to find the word corresponding to a given frequency rank, since some words may be equally frequent in the document.
Upvotes: 1