FrancoisTheFrenchOne
FrancoisTheFrenchOne

Reputation: 15

With which treebank are the available StanfordCoreNLP French models trained?

As per the title of this post, I would like to have a maximum of information regarding the dataset that is being used to train the StanfordCoreNLP French models that are made available on this page (https://stanfordnlp.github.io/CoreNLP/history.html). My ultimate aim is to know the set of tag that I can expect to be output by the stanford core nlp tool when using it to characterize a text written in French. I was told that a model is trained using a treebank. For for the French language, there is 6 of them (http://universaldependencies.org/, section for the French language) : - FTB - Original - Sequoia - ParTUT - PUD - Spoken So I would like to know which of them was used to train which French model.

I have first asked this question on the mailing list dedicated to the java nlp users ([email protected]), but to no avail up until now.

So, again, assuming it is one the treebanks described above that was indeed used to train the stanford core nlp French models available at the link posted above, which one is it? Alternatively, who (name and surname) would know the answer to this question, if no one here knows?

Upvotes: 0

Views: 183

Answers (1)

StanfordNLPHelp
StanfordNLPHelp

Reputation: 8739

For all who are curious about this, here is some info about the datasets used for French in Stanford CoreNLP:

French POS tagger: CC (Crabbe and Candito) modified French Treebank
French POS tagged (UD version): UD 1.3
French Constituency Parser: CC modified French Treebank
French NN Dependency Parser: UD 1.3

Also note that the constituency parser parse cannot translate constituency parses into dependency parses the way the English constituency parser can.

Upvotes: 0

Related Questions