Reputation: 785
I am currently utilizing this JAR file for the Stanford NLP models: stanford-corenlp-3.5.2-models.jar
This file is pretty big: its about 340 MB.
I am only using 4 models: tokenize
, ssplit
, parse
, and lemma
. Is there any way that I can use a smaller model JAR file (or is there a JAR file for each individual model) because I absolutely need the size of this file to be as small as possible
Upvotes: 2
Views: 1348
Reputation: 1
Following the advice of StanfordNLPHelp, I did this (I use Gradle):
Downloaded the CoreNLP from: Stanford CoreNLP download
Unjar the stanford-corenlp-X-models.jar
go /edu/Stanford/nlp/models
delete folders which are not relevant. Unfortunately this is a bit guess and check
Rezip the folder and convert it into a jar (I simply just changed the extension, which might be a bit frowned upon)
Add a libs folder to my gradle project ./app/libs
Move stanford-corenlp-x.jar from the download in there and the new jar made above
In the build.gradle add
implementation files('libs/stanford-corenlp-4.4.0.jar')
implementation files('libs/stanford-corenlp-4.4.0-models.jar')
Run gradle build. If there is an error, you deleted an important file. Revert and rezip, and so on.
Upvotes: 0
Reputation: 329
Following similar approach as mentioned by @StanfordNLPHelp, I used maven-shade-plugin and reduced the size of my final compiled jar file. You need to change "Package.MainClass" and the includes
tag or add excludes
tags
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- adding Main-Class to manifest file -->
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>Package.MainClass</mainClass>
</transformer>
</transformers>
<minimizeJar>true</minimizeJar>
<filters>
<filter>
<artifact>edu.stanford.nlp:stanford-corenlp</artifact>
<includes>
<include>**</include>
</includes>
</filter>
<filter>
<artifact>edu.stanford.nlp:stanford-corenlp:models</artifact>
<includes>
<include>edu/stanford/nlp/models/pos-tagger/**</include>
</includes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
Upvotes: 0
Reputation: 8739
You should be fine if you just include the parser's model file in your classpath and the pos tagger's model file. "lemma" requires "pos" , so you will need to include that in your list of annotators.
For instance: "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz" and "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" should be all you need.
You could just create that directory structure and include those files in your classpath, or make a jar with just those files in it. You can definitely cut out most of that jar.
The bottom line is that if you're missing something, your code will crash with a missing resources error. So you simply need to keep adding files until the code stops crashing. You definitely don't need a lot of the files in that jar.
Upvotes: 3